HFEPX Benchmark Hub
APPS + Coding Benchmark Papers
Updated from current HFEPX corpus (Mar 17, 2026). 2 papers are grouped in this benchmark page.
Read Full Context
Updated from current HFEPX corpus (Mar 17, 2026). 2 papers are grouped in this benchmark page. Common evaluation modes: Human Eval, Llm As Judge. Frequently cited benchmark: APPS. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jan 5, 2026.