HFEPX Hub
Automatic Metrics + Critique Edit + General Papers
Updated from current HFEPX corpus (Apr 12, 2026). 11 papers are grouped in this hub page.
Read Full Context
Updated from current HFEPX corpus (Apr 12, 2026). 11 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics. Common annotation unit: Multi Dim Rubric. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: AIME. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 22, 2026.