HFEPX Hub

CS.LG + Web Browsing Papers

Updated from current HFEPX corpus (Mar 8, 2026). 10 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 10 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Adjudication. Frequently cited benchmark: DROP. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 16, 2026.

Papers: 10 Last published: Feb 16, 2026 Global RSS Tag RSS

Cs.LGWeb Browsing

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (10) Replication-Ready Only (1)

High-Signal Coverage

100.0%

10 / 10 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

1 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
1 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters For Eval Research

37.5% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 50% of papers in this hub.
DROP is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Most common quality-control signal is adjudication (10% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Benchmark Interpretation

DROP appears in 12.5% of hub papers (1/10); use this cohort for benchmark-matched comparisons.
Innoeval appears in 12.5% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 37.5% of hub papers (3/10); compare with a secondary metric before ranking methods.
f1 is reported in 12.5% of hub papers (1/10); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (37.5% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (12.5% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (62.5% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (12.5% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (12.5% vs 35% target).

Strengths

Agentic evaluation appears in 100% of papers.

Known Gaps

Only 12.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (12.5% coverage).
Annotation unit is under-specified (12.5% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (DROP vs Innoeval) before comparing methods.
Track metric sensitivity by reporting both accuracy and f1.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: DROP Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-…

Highest protocol score with explicit human/eval signal plus Innoeval.

Strongest benchmark reference

MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation…

Reported benchmark with success rate gives a fast comparison anchor.

Strongest recent paper

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in th…

Useful for current practice scanning; published Feb 3, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem
Feb 16, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Llm As Judge · Benchmark: Innoeval · Metric: Not Reported
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models
Mar 3, 2026 · Citations: 0 · Score: 6.0

HF: Red Team · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Success rate
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Feb 3, 2026 · Citations: 0 · Score: 5.5

HF: Not reported · Eval: Automatic Metrics · Benchmark: DROP · Metric: Accuracy
TimeWarp: Evaluating Web Agents by Revisiting the Past
Mar 5, 2026 · Citations: 0 · Score: 4.5

HF: Demonstrations · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported
A Benchmark for Deep Information Synthesis
Feb 24, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: F1
Replaying pre-training data improves fine-tuning
Mar 5, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem Feb 16, 2026	No Not Reported	Llm As Judge	Innoeval	Not Reported	Adjudication
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models Mar 3, 2026	Yes Red Team	Automatic Metrics	Not Reported	Success rate , Jailbreak success rate	Not Reported
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild? Feb 3, 2026	No Not Reported	Automatic Metrics	DROP	Accuracy	Not Reported
TimeWarp: Evaluating Web Agents by Revisiting the Past Mar 5, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
A Benchmark for Deep Information Synthesis Feb 24, 2026	No Not Reported	Automatic Metrics	Not Reported	F1	Not Reported
Replaying pre-training data improves fine-tuning Mar 5, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL Feb 25, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy , Task success	Not Reported
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation Oct 21, 2025	Yes Demonstrations	Simulation Env	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	InnoEval: On Research Idea Evaluation as a Knowledg…	MUSE: A Run-Centric Platform for Multimodal Unified…	SpatiaLab: Can Vision-Language Models Perform Spati…
Human Feedback	Not reported	Red Team	Not reported
Evaluation Modes	Llm As Judge	Automatic Metrics	Automatic Metrics
Benchmarks	Innoeval	Not reported	DROP
Metrics	Not reported	Success rate, Jailbreak success rate	Accuracy
Quality Controls	Adjudication	Not reported	Not reported
Rater Population	Domain Experts	Unknown	Unknown
Annotation Unit	Unknown	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Demonstrations (2)
Red Team (1)

Evaluation Modes

Automatic Metrics (5)
Llm As Judge (1)
Simulation Env (1)

Top Benchmarks

DROP (1)
Innoeval (1)

Top Metrics

Accuracy (3)
F1 (1)
Jailbreak success rate (1)
Success rate (1)

Rater Population Mix

Domain Experts (1)

Quality Controls

Adjudication (1)

Coverage diagnostics (sample-based): human-feedback 30.0% · benchmarks 30.0% · metrics 50.0% · quality controls 10.0%.

Top Papers

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem
Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue · Feb 16, 2026 · Citations: 0

Llm As Judge Web Browsing

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation.
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang · Oct 21, 2025 · Citations: 0

Demonstrations Simulation Env Long Horizon

Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models
Zhongxi Wang, Yueqian Lin, Jingyang Zhang, Hai Helen Li, Yiran Chen · Mar 3, 2026 · Citations: 0

Red Team Automatic Metrics Web Browsing

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs.
TimeWarp: Evaluating Web Agents by Revisiting the Past
Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026 · Citations: 0

Demonstrations Web Browsing

The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar · Feb 3, 2026 · Citations: 0

Automatic Metrics Web Browsing

To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts.
A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026 · Citations: 0

Automatic Metrics Tool Use

To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights.
Replaying pre-training data improves fine-tuning
Suhas Kotha, Percy Liang · Mar 5, 2026 · Citations: 0

Automatic Metrics Web Browsing

We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by 4.5\% and Basque question-answering accuracy by 2\%.
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
UI-Venus-1.5 Technical Report
Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu · Feb 9, 2026 · Citations: 0

Long Horizon

In this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world applications.
Visual Planning: Let's Think Only with Images
Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang · May 16, 2025 · Citations: 0

Web Browsing

In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote