HFEPX Hub

Web Browsing + Automatic Metrics (Last 45 Days)

Updated from current HFEPX corpus (Mar 1, 2026). 10 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 10 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Frequently cited benchmark: BrowseComp. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 18, 2026.

Papers: 10 Last published: Feb 18, 2026 Global RSS Tag RSS

Web BrowsingAutomatic MetricsLast 45d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (10) Replication-Ready Only (1)

High-Signal Coverage

100.0%

10 / 10 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

1 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters (Expanded)

Why This Matters For Eval Research

20% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 100% of papers in this hub.
BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.
Stratify by benchmark (BrowseComp vs Memoryarena) before comparing methods.

Benchmark Interpretation

BrowseComp appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.
Memoryarena appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 40% of hub papers (4/10); compare with a secondary metric before ranking methods.
latency is reported in 20% of hub papers (2/10); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (20% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (20% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (90% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (10% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Strengths

Agentic evaluation appears in 100% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10% coverage).
Annotation unit is under-specified (0% coverage).

Suggested Next Analyses

Stratify by benchmark (BrowseComp vs Memoryarena) before comparing methods.
Track metric sensitivity by reporting both accuracy and latency.

Recommended Queries (Expanded)

Recommended Queries

Benchmark Slice: BrowseComp Metric Slice: accuracy Recent High-Signal Papers

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Feb 18, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Memoryarena · Metric: Recall
Modeling Distinct Human Interaction in Web Agents
Feb 19, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
Feb 13, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Automatic Metrics, Simulation Env · Benchmark: Not Reported · Metric: Accuracy
A Benchmark for Deep Information Synthesis
Feb 24, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: F1
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
Feb 26, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Precision
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Feb 25, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks Feb 18, 2026	Yes Pairwise Preference	Automatic Metrics	Memoryarena	Recall	Not Reported
Modeling Distinct Human Interaction in Web Agents Feb 19, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy	Not Reported
BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents Feb 13, 2026	No Not Reported	Automatic Metrics , Simulation Env	Not Reported	Accuracy	Not Reported
A Benchmark for Deep Information Synthesis Feb 24, 2026	No Not Reported	Automatic Metrics	Not Reported	F1	Not Reported
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents Feb 26, 2026	No Not Reported	Automatic Metrics	Not Reported	Precision , Latency	Not Reported
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL Feb 25, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy , Task success	Not Reported
Mind the Style: Impact of Communication Style on Human-Chatbot Interaction Feb 19, 2026	No Not Reported	Automatic Metrics	Not Reported	Task success	Not Reported
The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task Feb 11, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported
INSURE-Dial: A Phase-Aware Conversational Dataset & Benchmark for Compliance Verification and Phase Detection Jan 28, 2026	No Not Reported	Automatic Metrics	Not Reported	Latency	Not Reported
Onboard-Targeted Segmentation of Straylight in Space Camera Sensors Feb 24, 2026	No Not Reported	Automatic Metrics	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	MemoryArena: Benchmarking Agent Memory in Interdepe…	Modeling Distinct Human Interaction in Web Agents	BrowseComp-$V^3$: A Visual, Vertical, and Verifiabl…
Human Feedback	Pairwise Preference	Pairwise Preference	Not reported
Evaluation Modes	Automatic Metrics	Automatic Metrics	Automatic Metrics, Simulation Env
Benchmarks	Memoryarena	Not reported	Not reported
Metrics	Recall	Accuracy	Accuracy
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Domain Experts
Annotation Unit	Unknown	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (2)

Evaluation Modes

Automatic Metrics (10)
Simulation Env (1)

Top Benchmarks

BrowseComp (1)
Memoryarena (1)

Top Metrics

Accuracy (4)
Latency (2)
Task success (2)
F1 (1)

Rater Population Mix

Domain Experts (1)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 20.0% · benchmarks 10.0% · metrics 90.0% · quality controls 0.0%.

Top Papers

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen · Feb 18, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Web Browsing

Existing evaluations of agents with memory typically assess memorization and action in isolation.
BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Shan · Feb 13, 2026 · Citations: 0

Automatic MetricsSimulation Env Web Browsing

Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments.
Modeling Distinct Human Interaction in Web Agents
Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou · Feb 19, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Web Browsing

In this work, we introduce the task of modeling human intervention to support collaborative web task execution.
A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026 · Citations: 0

Automatic Metrics Tool Use

To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights.
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
Zhou Xu, Bowen Zhou, Qi Wang, Shuwen Feng, Jingyu Xiao · Feb 26, 2026 · Citations: 0

Automatic Metrics Web Browsing

Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories.
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
Mind the Style: Impact of Communication Style on Human-Chatbot Interaction
Erik Derner, Dalibor Kučera, Aditya Gulati, Ayoub Bagheri, Nuria Oliver · Feb 19, 2026 · Citations: 0

Automatic Metrics Web Browsing

Conversational agents increasingly mediate everyday digital interactions, yet the effects of their communication style on user experience and task success remain unclear.
The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task
Rui Cao, Zhenyun Deng, Yulong Chen, Michael Schlichtkrull, Andreas Vlachos · Feb 11, 2026 · Citations: 0

Automatic Metrics Web Browsing

The winning team, HUMANE, achieved an AVerImaTeC score of 0.5455.
INSURE-Dial: A Phase-Aware Conversational Dataset & Benchmark for Compliance Verification and Phase Detection
Shubham Kulkarni, Alexander Lyzhov, Preetam Joshi, Shiva Chaitanya · Jan 28, 2026 · Citations: 0

Automatic Metrics Web Browsing

We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification.
Onboard-Targeted Segmentation of Straylight in Space Camera Sensors
Riccardo Gallon, Fabian Schiemenz, Alessandra Menicucci, Eberhard Gill · Feb 24, 2026 · Citations: 0

Automatic Metrics Web Browsing

This study details an artificial intelligence (AI)-based methodology for the semantic segmentation of space camera faults.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote