AI Conversation Evaluator (Project-Based)
I conducted realistic, goal-oriented multi-turn dialogues with large language models to assess response quality and accuracy. My work involved simulating diverse user personas and emotional states to stress-test AI conversational abilities. I followed structured QA rubrics and documented model performance failure modes consistently. • Focused on instruction-following and narrative coherence during dialogue evaluations. • Identified hallucinations, inconsistency, and emotional tone issues in model outputs. • Applied strict annotation guidelines for high-quality, reproducible results. • Utilized Remotasks software to document and manage all conversation data.