HFEPX Hub
General + Llm As Judge Papers
Updated from current HFEPX corpus (Feb 27, 2026). 8 papers are grouped in this hub page. Common evaluation modes: Llm As Judge, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Caparena. Common metric signal: agreement. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 24, 2026.