AI DevOps - LLM Evaluation and Benchmarking
Designed and specified multi-step agentic benchmark tasks to evaluate Large Language Model (LLM) performance. Wrote precise task specifications and scoring criteria to ensure comprehensive assessment of models. Analyzed model failure modes and developed solutions for improved evaluation quality. • Developed new benchmarks for LLM assessment • Authored clear rubrics and evaluation frameworks • Utilized prompt engineering to guide AI responses • Collaborated with DevOps to automate test suites and validate outcomes