HFEPX Archive Slice
HFEPX Daily Archive: 2026-01-29
Updated from current HFEPX corpus (Mar 8, 2026). 5 papers are grouped in this daily page.
Read Full Context
Updated from current HFEPX corpus (Mar 8, 2026). 5 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Common annotation unit: Freeform. Frequently cited benchmark: ALFWorld. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jan 29, 2026.