HFEPX Benchmark Hub
GSM8K Or MMLU Or AIME Benchmark Papers
Snapshot fallback from 2026-06-21. This benchmark page remains available while the live HFEPX payload refreshes; use the full paper list after the API recovers.
HFEPX Benchmark Hub
Snapshot fallback from 2026-06-21. This benchmark page remains available while the live HFEPX payload refreshes; use the full paper list after the API recovers.
Use this page for benchmark-matched method comparisons and eval protocol selection. Quality band: Developing .
Analysis blocks are computed from the loaded sample (0 of 72 papers).
High-Signal Coverage
0%
0 / 0 sampled papers are not low-signal flagged.
Replication-Ready Set
0
Papers with explicit benchmark + metric + eval mode fields.
Quality Controls
0%
0 papers report calibration/adjudication/IAA controls.
Primary action: Use this page to map benchmark mentions first; wait for stronger metric/QC coverage before strict comparisons.
Gap: Human feedback
Human feedback coverage requires the live HFEPX payload.
Gap: Quality controls
Quality controls coverage requires the live HFEPX payload.
Gap: Benchmarks
Benchmarks coverage requires the live HFEPX payload.
Gap: Metrics
Metrics coverage requires the live HFEPX payload.
Gap: Known rater population
Known rater population coverage requires the live HFEPX payload.
Gap: Known annotation unit
Known annotation unit coverage requires the live HFEPX payload.
Evaluation Modes
Human Feedback Mix
Top Benchmarks
Top Metrics
No papers available for this benchmark yet.