HFEPX Benchmark Hub
LongBench In CS.CL Papers
Updated from current HFEPX corpus (Apr 27, 2026). 9 papers are grouped in this benchmark page.
Read Full Context
Updated from current HFEPX corpus (Apr 27, 2026). 9 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequently cited benchmark: LongBench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 9, 2026.