HFEPX Hub

Automatic Metrics + General + Web Browsing Papers

Updated from current HFEPX corpus (Mar 8, 2026). 11 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 11 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Frequently cited benchmark: BrowseComp. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 18, 2026.

Papers: 11 Last published: Feb 18, 2026 Global RSS Tag RSS

Automatic MetricsGeneralWeb Browsing

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (11) Replication-Ready Only (3)

High-Signal Coverage

100.0%

11 / 11 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

3 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Why This Matters For Eval Research

45.5% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 100% of papers in this hub.
BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.
Stratify by benchmark (BrowseComp vs DROP) before comparing methods.

Benchmark Interpretation

BrowseComp appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.
DROP appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 36.4% of hub papers (4/11); compare with a secondary metric before ranking methods.
jailbreak success rate is reported in 18.2% of hub papers (2/11); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (45.5% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Strong: Papers naming benchmarks/datasets

Coverage is strong (36.4% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (90.9% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (9.1% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Strengths

Strong human-feedback signal (45.5% of papers).
Most papers provide measurable evaluation context (36.4% benchmarks, 90.9% metrics).
Agentic evaluation appears in 100% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.1% coverage).
Annotation unit is under-specified (0% coverage).

Suggested Next Analyses

Stratify by benchmark (BrowseComp vs DROP) before comparing methods.
Track metric sensitivity by reporting both accuracy and jailbreak success rate.

Recommended Queries (Expanded)

Recommended Queries

Benchmark Slice: BrowseComp Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Sessio…

Highest protocol score with explicit human/eval signal plus Memoryarena.

Strongest benchmark reference

RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in H…

Rtc-Bench with jailbreak success rate gives a fast comparison anchor.

Strongest recent paper

MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation…

Useful for current practice scanning; published Mar 3, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Feb 18, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Memoryarena · Metric: Recall
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
May 28, 2025 · Citations: 0 · Score: 6.5

HF: Red Team · Eval: Automatic Metrics · Benchmark: Rtc Bench · Metric: Jailbreak success rate
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models
Mar 3, 2026 · Citations: 0 · Score: 6.0

HF: Red Team · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Success rate
Modeling Distinct Human Interaction in Web Agents
Feb 19, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Feb 3, 2026 · Citations: 0 · Score: 5.5

HF: Not reported · Eval: Automatic Metrics · Benchmark: DROP · Metric: Accuracy
BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
Feb 13, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Automatic Metrics, Simulation Env · Benchmark: Not Reported · Metric: Accuracy

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks Feb 18, 2026	Yes Pairwise Preference	Automatic Metrics	Memoryarena	Recall	Not Reported
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments May 28, 2025	Yes Red Team	Automatic Metrics	Rtc Bench	Jailbreak success rate	Not Reported
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models Mar 3, 2026	Yes Red Team	Automatic Metrics	Not Reported	Success rate , Jailbreak success rate	Not Reported
Modeling Distinct Human Interaction in Web Agents Feb 19, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy	Not Reported
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild? Feb 3, 2026	No Not Reported	Automatic Metrics	DROP	Accuracy	Not Reported
BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents Feb 13, 2026	No Not Reported	Automatic Metrics , Simulation Env	Not Reported	Accuracy	Not Reported
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents Feb 26, 2026	No Not Reported	Automatic Metrics	Not Reported	Precision , Latency	Not Reported
Mind the Style: Impact of Communication Style on Human-Chatbot Interaction Feb 19, 2026	No Not Reported	Automatic Metrics	Not Reported	Task success	Not Reported
The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task Feb 11, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation Jan 28, 2025	Yes Pairwise Preference , Demonstrations	Automatic Metrics	Not Reported	Success rate , Task success	Not Reported
Onboard-Targeted Segmentation of Straylight in Space Camera Sensors Feb 24, 2026	No Not Reported	Automatic Metrics	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	MemoryArena: Benchmarking Agent Memory in Interdepe…	RedTeamCUA: Realistic Adversarial Testing of Comput…	MUSE: A Run-Centric Platform for Multimodal Unified…
Human Feedback	Pairwise Preference	Red Team	Red Team
Evaluation Modes	Automatic Metrics	Automatic Metrics	Automatic Metrics
Benchmarks	Memoryarena	Rtc Bench	Not reported
Metrics	Recall	Jailbreak success rate	Success rate, Jailbreak success rate
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Unknown
Annotation Unit	Unknown	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (3)
Red Team (2)
Demonstrations (1)

Evaluation Modes

Automatic Metrics (11)
Simulation Env (1)

Top Benchmarks

BrowseComp (1)
DROP (1)
Memoryarena (1)
Rtc Bench (1)

Top Metrics

Accuracy (4)
Jailbreak success rate (2)
Success rate (2)
Task success (2)

Rater Population Mix

Domain Experts (1)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 45.5% · benchmarks 27.3% · metrics 90.9% · quality controls 0.0%.

Top Papers

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen · Feb 18, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Web Browsing

Existing evaluations of agents with memory typically assess memorization and action in isolation.
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier · May 28, 2025 · Citations: 0

Red Team Automatic Metrics Web Browsing

Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities.
BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Shan · Feb 13, 2026 · Citations: 0

Automatic MetricsSimulation Env Web Browsing

Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments.
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models
Zhongxi Wang, Yueqian Lin, Jingyang Zhang, Hai Helen Li, Yiran Chen · Mar 3, 2026 · Citations: 0

Red Team Automatic Metrics Web Browsing

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs.
Modeling Distinct Human Interaction in Web Agents
Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou · Feb 19, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Web Browsing

In this work, we introduce the task of modeling human intervention to support collaborative web task execution.
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou · Jan 28, 2025 · Citations: 0

Pairwise PreferenceDemonstrations Automatic Metrics Web Browsing

We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency.
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar · Feb 3, 2026 · Citations: 0

Automatic Metrics Web Browsing

To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts.
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
Zhou Xu, Bowen Zhou, Qi Wang, Shuwen Feng, Jingyu Xiao · Feb 26, 2026 · Citations: 0

Automatic Metrics Web Browsing

Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories.
Mind the Style: Impact of Communication Style on Human-Chatbot Interaction
Erik Derner, Dalibor Kučera, Aditya Gulati, Ayoub Bagheri, Nuria Oliver · Feb 19, 2026 · Citations: 0

Automatic Metrics Web Browsing

Conversational agents increasingly mediate everyday digital interactions, yet the effects of their communication style on user experience and task success remain unclear.
The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task
Rui Cao, Zhenyun Deng, Yulong Chen, Michael Schlichtkrull, Andreas Vlachos · Feb 11, 2026 · Citations: 0

Automatic Metrics Web Browsing

The winning team, HUMANE, achieved an AVerImaTeC score of 0.5455.
Onboard-Targeted Segmentation of Straylight in Space Camera Sensors
Riccardo Gallon, Fabian Schiemenz, Alessandra Menicucci, Eberhard Gill · Feb 24, 2026 · Citations: 0

Automatic Metrics Web Browsing

This study details an artificial intelligence (AI)-based methodology for the semantic segmentation of space camera faults.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote