HFEPX Benchmark Hub
MT-Bench In CS.AI Papers
Updated from current HFEPX corpus (Mar 8, 2026). 4 papers are grouped in this benchmark page.
Read Full Context
Updated from current HFEPX corpus (Mar 8, 2026). 4 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: MT-Bench. Common metric signal: elo. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 13, 2026.