LLM Output Evaluation & Semantic Labeling for Creator Content
Led human-in-the-loop evaluation and labeling of large language model outputs for creator-focused social media content. Defined and applied semantic labeling criteria to assess alignment, intent preservation, tone consistency, and contextual correctness across generated text outputs. Performed qualitative review of model responses, identified common failure modes, and flagged ambiguous or borderline cases requiring guideline refinement. Labeled and scored hundreds of examples using structured rubrics to ensure consistency across annotations. Conducted spot checks and self-audits to maintain high labeling accuracy and reduce subjectivity. Collaborated with iterative model development by providing feedback on mislabeled edge cases and proposing improvements to evaluation guidelines. Focused on producing high-quality, reliable labels suitable for downstream model training and evaluation.