LLM Evaluation and Rubric Scoring Projects
Contributed to large-scale LLM training and evaluation tasks by assessing AI-generated text for instruction following, factual accuracy, tone, and coherence. Designed and applied detailed rubrics to score responses across varied difficulty levels and contextual settings. Created prompts for supervised fine-tuning (SFT) datasets in both Hindi and English to improve multilingual model performance. Ensured high-quality data through consistency checks, peer review, and adherence to precision-based QA metrics. Delivered consistently high agreement scores across batches and contributed to refining guideline clarity for improved model alignment.