HFEPX Benchmark Hub
WebArena Benchmark Papers (Last 365 Days)
Updated from current HFEPX corpus (Apr 17, 2026). 10 papers are grouped in this benchmark page.
Read Full Context
Updated from current HFEPX corpus (Apr 17, 2026). 10 papers are grouped in this benchmark page. Common evaluation modes: Simulation Env, Human Eval. Common annotation unit: Trajectory. Frequently cited benchmark: WebArena. Common metric signal: success rate. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 22, 2026.