RWS Multi-Image LLM Evaluation & Annotation
Worked on RWS multi-image LLM evaluation tasks requiring reasoning across multiple related images paired with text prompts. Assessed whether model responses correctly interpreted visual content across image sets, including objects, attributes, relationships, comparisons, and contextual details. Identified hallucinations, missed visual elements, incorrect associations between images, and flawed multi-step reasoning. Ranked multiple responses, flagged visual and logical errors, and applied strict evaluation rubrics to ensure alignment with instructions. Maintained high accuracy, consistency, and attention to detail on complex multimodal tasks under tight guidelines.