HFEPX Hub

Tool Use + Automatic Metrics (Last 90 Days)

Updated from current HFEPX corpus (Mar 10, 2026). 11 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 10, 2026). 11 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequently cited benchmark: MMLU. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 9, 2026.

Papers: 11 Last published: Mar 9, 2026 Global RSS Tag RSS

Tool UseAutomatic MetricsLast 90d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (11) Replication-Ready Only (3)

High-Signal Coverage

100.0%

11 / 11 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

3 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Currently showing only replication-ready papers in ranking and matrix sections (3 papers).

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

27.3% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 100% of papers in this hub.
MMLU is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Stratify by benchmark (MMLU vs Onemillion-Bench) before comparing methods.

Benchmark Interpretation

MMLU appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.
Onemillion-Bench appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 36.4% of hub papers (4/11); compare with a secondary metric before ranking methods.
cost is reported in 36.4% of hub papers (4/11); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (27.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (27.3% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (81.8% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (9.1% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (27.3% vs 35% target).

Strengths

Agentic evaluation appears in 100% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.1% coverage).

Suggested Next Analyses

Stratify by benchmark (MMLU vs Onemillion-Bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries (Expanded)

Recommended Queries

Benchmark Slice: MMLU Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

Highest protocol score with explicit human/eval signal plus Onemillion-Bench.

Strongest benchmark reference

Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Infe…

MMLU with accuracy gives a fast comparison anchor.

Strongest recent paper

Zooming without Zooming: Region-to-Image Distillation for Fine-Graine…

Useful for current practice scanning; published Feb 12, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

\$OneMillion-Bench: How Far are Language Agents from Human Experts?
Mar 9, 2026 · Citations: 0 · Score: 8.0

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: Onemillion Bench · Metric: Accuracy
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Feb 25, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: MMLU · Metric: Accuracy
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Feb 12, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Zoombench · Metric: Latency

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
\$OneMillion-Bench: How Far are Language Agents from Human Experts? Mar 9, 2026	Yes Rubric Rating	Automatic Metrics	Onemillion Bench	Accuracy , Coherence	Not Reported
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference Feb 25, 2026	No Not Reported	Automatic Metrics	MMLU	Accuracy , Cost	Not Reported
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception Feb 12, 2026	No Not Reported	Automatic Metrics	Zoombench	Latency	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	\$OneMillion-Bench: How Far are Language Agents fro…	Confidence-Driven Multi-Scale Model Selection for C…	Zooming without Zooming: Region-to-Image Distillati…
Human Feedback	Rubric Rating	Not reported	Not reported
Evaluation Modes	Automatic Metrics	Automatic Metrics	Automatic Metrics
Benchmarks	Onemillion Bench	MMLU	Zoombench
Metrics	Accuracy, Coherence	Accuracy, Cost	Latency
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Domain Experts	Unknown	Unknown
Annotation Unit	Multi Dim Rubric	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (1)
Red Team (1)
Rubric Rating (1)

Evaluation Modes

Automatic Metrics (11)

Top Benchmarks

MMLU (1)
Onemillion Bench (1)
Zoombench (1)

Top Metrics

Accuracy (4)
Cost (4)
Latency (2)
Coherence (1)

Rater Population Mix

Domain Experts (1)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 27.3% · benchmarks 27.3% · metrics 81.8% · quality controls 0.0%.

Top Papers

\$OneMillion-Bench: How Far are Language Agents from Human Experts?
Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen · Mar 9, 2026 · Citations: 0

Rubric Rating Automatic Metrics Tool Use

To this end, we introduce \OneMillion-Bench \OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios.
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026 · Citations: 0

Automatic Metrics Tool Use

Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai · Feb 12, 2026 · Citations: 0

Automatic Metrics Tool Use

To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote