HFEPX Benchmark Hub
WebArena In CS.CL Papers
Updated from current HFEPX corpus (Apr 17, 2026). 8 papers are grouped in this benchmark page.
Read Full Context
Updated from current HFEPX corpus (Apr 17, 2026). 8 papers are grouped in this benchmark page. Common evaluation modes: Simulation Env, Human Eval. Common annotation unit: Trajectory. Frequently cited benchmark: WebArena. Common metric signal: success rate. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 22, 2026.