HFEPX Hub

Web Browsing Papers (Last 30 Days)

Updated from current HFEPX corpus (Mar 1, 2026). 15 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 15 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Adjudication. Frequently cited benchmark: BrowseComp. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 18, 2026.

Papers: 15 Last published: Feb 18, 2026 Global RSS Tag RSS

Web BrowsingLast 30d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (15) Replication-Ready Only (1)

High-Signal Coverage

100.0%

15 / 15 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

1 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
1 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters (Expanded)

Why This Matters For Eval Research

15.4% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 60% of papers in this hub.
BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Most common quality-control signal is adjudication (6.7% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Benchmark Interpretation

BrowseComp appears in 7.7% of hub papers (1/15); use this cohort for benchmark-matched comparisons.
Innoeval appears in 7.7% of hub papers (1/15); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 30.8% of hub papers (4/15); compare with a secondary metric before ranking methods.
task success is reported in 15.4% of hub papers (2/15); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (15.4% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (7.7% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (30.8% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (61.5% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (15.4% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (15.4% vs 35% target).

Strengths

Agentic evaluation appears in 100% of papers.

Known Gaps

Only 7.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (15.4% coverage).
Annotation unit is under-specified (15.4% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (BrowseComp vs Innoeval) before comparing methods.
Track metric sensitivity by reporting both accuracy and task success.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: BrowseComp Metric Slice: accuracy Recent High-Signal Papers

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Feb 18, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Memoryarena · Metric: Recall
InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem
Feb 16, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Llm As Judge · Benchmark: Innoeval · Metric: Not Reported
Modeling Distinct Human Interaction in Web Agents
Feb 19, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Feb 15, 2026 · Citations: 0 · Score: 4.5

HF: Not reported · Eval: Simulation Env · Benchmark: WebArena · Metric: Not Reported
BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
Feb 13, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Automatic Metrics, Simulation Env · Benchmark: Not Reported · Metric: Accuracy
A Benchmark for Deep Information Synthesis
Feb 24, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: F1

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks Feb 18, 2026	Yes Pairwise Preference	Automatic Metrics	Memoryarena	Recall	Not Reported
InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem Feb 16, 2026	No Not Reported	Llm As Judge	Innoeval	Not Reported	Adjudication
Modeling Distinct Human Interaction in Web Agents Feb 19, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy	Not Reported
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents Feb 15, 2026	No Not Reported	Simulation Env	WebArena , OSWorld	Not Reported	Not Reported
BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents Feb 13, 2026	No Not Reported	Automatic Metrics , Simulation Env	Not Reported	Accuracy	Not Reported
A Benchmark for Deep Information Synthesis Feb 24, 2026	No Not Reported	Automatic Metrics	Not Reported	F1	Not Reported
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents Feb 26, 2026	No Not Reported	Automatic Metrics	Not Reported	Precision , Latency	Not Reported
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL Feb 25, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy , Task success	Not Reported
Mind the Style: Impact of Communication Style on Human-Chatbot Interaction Feb 19, 2026	No Not Reported	Automatic Metrics	Not Reported	Task success	Not Reported
The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task Feb 11, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported
Efficient Hierarchical Any-Angle Path Planning on Multi-Resolution 3D Grids Feb 24, 2026	No Not Reported	Simulation Env	Not Reported	Not Reported	Not Reported
Contextual Safety Reasoning and Grounding for Open-World Robots Feb 23, 2026	No Not Reported	Simulation Env	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	MemoryArena: Benchmarking Agent Memory in Interdepe…	InnoEval: On Research Idea Evaluation as a Knowledg…	Modeling Distinct Human Interaction in Web Agents
Human Feedback	Pairwise Preference	Not reported	Pairwise Preference
Evaluation Modes	Automatic Metrics	Llm As Judge	Automatic Metrics
Benchmarks	Memoryarena	Innoeval	Not reported
Metrics	Recall	Not reported	Accuracy
Quality Controls	Not reported	Adjudication	Not reported
Rater Population	Unknown	Domain Experts	Unknown
Annotation Unit	Unknown	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (2)

Evaluation Modes

Automatic Metrics (9)
Simulation Env (4)
Llm As Judge (1)

Top Benchmarks

BrowseComp (1)
Innoeval (1)
Memoryarena (1)
OSWorld (1)

Top Metrics

Accuracy (4)
Task success (2)
F1 (1)
Latency (1)

Rater Population Mix

Domain Experts (2)

Quality Controls

Adjudication (1)

Coverage diagnostics (sample-based): human-feedback 13.3% · benchmarks 26.7% · metrics 53.3% · quality controls 6.7%.

Top Papers

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen · Feb 18, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Web Browsing

Existing evaluations of agents with memory typically assess memorization and action in isolation.
InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem
Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue · Feb 16, 2026 · Citations: 0

Llm As Judge Web Browsing

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation.
BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Shan · Feb 13, 2026 · Citations: 0

Automatic MetricsSimulation Env Web Browsing

Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments.
Modeling Distinct Human Interaction in Web Agents
Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou · Feb 19, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Web Browsing

In this work, we introduce the task of modeling human intervention to support collaborative web task execution.
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu · Feb 15, 2026 · Citations: 0

Simulation Env Long Horizon

The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge…
Efficient Hierarchical Any-Angle Path Planning on Multi-Resolution 3D Grids
Victor Reijgwart, Cesar Cadena, Roland Siegwart, Lionel Ott · Feb 24, 2026 · Citations: 0

Simulation Env Long Horizon

Hierarchical, multi-resolution volumetric mapping approaches are widely used to represent large and complex environments as they can efficiently capture their occupancy and connectivity information.
Contextual Safety Reasoning and Grounding for Open-World Robots
Zachary Ravichandran, David Snyder, Alexander Robey, Hamed Hassani, Vijay Kumar · Feb 23, 2026 · Citations: 0

Simulation Env Web Browsing

Traditional safety approaches enforce fixed constraints in user-specified contexts, limiting their ability to handle the open-ended contextual variability of real-world deployment.
A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026 · Citations: 0

Automatic Metrics Tool Use

To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights.
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
Zhou Xu, Bowen Zhou, Qi Wang, Shuwen Feng, Jingyu Xiao · Feb 26, 2026 · Citations: 0

Automatic Metrics Web Browsing

Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories.
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
Mind the Style: Impact of Communication Style on Human-Chatbot Interaction
Erik Derner, Dalibor Kučera, Aditya Gulati, Ayoub Bagheri, Nuria Oliver · Feb 19, 2026 · Citations: 0

Automatic Metrics Web Browsing

Conversational agents increasingly mediate everyday digital interactions, yet the effects of their communication style on user experience and task success remain unclear.
The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task
Rui Cao, Zhenyun Deng, Yulong Chen, Michael Schlichtkrull, Andreas Vlachos · Feb 11, 2026 · Citations: 0

Automatic Metrics Web Browsing

The winning team, HUMANE, achieved an AVerImaTeC score of 0.5455.
Onboard-Targeted Segmentation of Straylight in Space Camera Sensors
Riccardo Gallon, Fabian Schiemenz, Alessandra Menicucci, Eberhard Gill · Feb 24, 2026 · Citations: 0

Automatic Metrics Web Browsing

This study details an artificial intelligence (AI)-based methodology for the semantic segmentation of space camera faults.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang · Feb 24, 2026 · Citations: 0

Tool Use

Agentic systems increasingly rely on reusable procedural capabilities, a.k.a., agentic skills, to execute long-horizon workflows reliably.
UI-Venus-1.5 Technical Report
Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu · Feb 9, 2026 · Citations: 0

Long Horizon

In this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world applications.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote