HFEPX Hub

Tool Use + Automatic Metrics (Last 90 Days)

Updated from current HFEPX corpus (Mar 10, 2026). 11 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 10, 2026). 11 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequently cited benchmark: MMLU. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 9, 2026.

Papers: 11 Last published: Mar 9, 2026 Global RSS Tag RSS

Tool UseAutomatic MetricsLast 90d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (11) Replication-Ready Only (3)

High-Signal Coverage

100.0%

11 / 11 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

3 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

27.3% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 100% of papers in this hub.
MMLU is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Stratify by benchmark (MMLU vs Onemillion-Bench) before comparing methods.

Benchmark Interpretation

MMLU appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.
Onemillion-Bench appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 36.4% of hub papers (4/11); compare with a secondary metric before ranking methods.
cost is reported in 36.4% of hub papers (4/11); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (27.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (27.3% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (81.8% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (9.1% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (27.3% vs 35% target).

Strengths

Agentic evaluation appears in 100% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.1% coverage).

Suggested Next Analyses

Stratify by benchmark (MMLU vs Onemillion-Bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries (Expanded)

Recommended Queries

Benchmark Slice: MMLU Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

Highest protocol score with explicit human/eval signal plus Onemillion-Bench.

Strongest benchmark reference

Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Infe…

MMLU with accuracy gives a fast comparison anchor.

Strongest recent paper

Zooming without Zooming: Region-to-Image Distillation for Fine-Graine…

Useful for current practice scanning; published Feb 12, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

\$OneMillion-Bench: How Far are Language Agents from Human Experts?
Mar 9, 2026 · Citations: 0 · Score: 8.0

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: Onemillion Bench · Metric: Accuracy
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Feb 25, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: MMLU · Metric: Accuracy
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Feb 12, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Zoombench · Metric: Latency
What Matters For Safety Alignment?
Jan 7, 2026 · Citations: 0 · Score: 5.5

HF: Red Team · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Success rate
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills
Dec 18, 2025 · Citations: 0 · Score: 5.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Cost
REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents
Feb 15, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Recall

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
\$OneMillion-Bench: How Far are Language Agents from Human Experts? Mar 9, 2026	Yes Rubric Rating	Automatic Metrics	Onemillion Bench	Accuracy , Coherence	Not Reported
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference Feb 25, 2026	No Not Reported	Automatic Metrics	MMLU	Accuracy , Cost	Not Reported
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception Feb 12, 2026	No Not Reported	Automatic Metrics	Zoombench	Latency	Not Reported
What Matters For Safety Alignment? Jan 7, 2026	Yes Red Team	Automatic Metrics	Not Reported	Success rate , Jailbreak success rate	Not Reported
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills Dec 18, 2025	Yes Pairwise Preference	Automatic Metrics	Not Reported	Cost	Not Reported
REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents Feb 15, 2026	No Not Reported	Automatic Metrics	Not Reported	Recall , Cost	Not Reported
A Benchmark for Deep Information Synthesis Feb 24, 2026	No Not Reported	Automatic Metrics	Not Reported	F1	Not Reported
EnsembleLink: Accurate Record Linkage Without Training Data Jan 29, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported
When Do Tools and Planning Help Large Language Models Think? A Cost- and Latency-Aware Benchmark Jan 6, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy , Latency	Not Reported
PyVision-RL: Forging Open Agentic Vision Models via RL Feb 24, 2026	No Not Reported	Automatic Metrics	Not Reported	Not Reported	Not Reported
STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models Feb 3, 2026	No Not Reported	Automatic Metrics	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	\$OneMillion-Bench: How Far are Language Agents fro…	Confidence-Driven Multi-Scale Model Selection for C…	Zooming without Zooming: Region-to-Image Distillati…
Human Feedback	Rubric Rating	Not reported	Not reported
Evaluation Modes	Automatic Metrics	Automatic Metrics	Automatic Metrics
Benchmarks	Onemillion Bench	MMLU	Zoombench
Metrics	Accuracy, Coherence	Accuracy, Cost	Latency
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Domain Experts	Unknown	Unknown
Annotation Unit	Multi Dim Rubric	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (1)
Red Team (1)
Rubric Rating (1)

Evaluation Modes

Automatic Metrics (11)

Top Benchmarks

MMLU (1)
Onemillion Bench (1)
Zoombench (1)

Top Metrics

Accuracy (4)
Cost (4)
Latency (2)
Coherence (1)

Rater Population Mix

Domain Experts (1)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 27.3% · benchmarks 27.3% · metrics 81.8% · quality controls 0.0%.

Top Papers

\$OneMillion-Bench: How Far are Language Agents from Human Experts?
Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen · Mar 9, 2026 · Citations: 0

Rubric Rating Automatic Metrics Tool Use

To this end, we introduce \OneMillion-Bench \OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios.
What Matters For Safety Alignment?
Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong · Jan 7, 2026 · Citations: 0

Red Team Automatic Metrics Tool Use

This paper presents a comprehensive empirical study on the safety alignment capabilities.
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills
Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He · Dec 18, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Tool Use

Large language model (LLM) agents are moving beyond prompting alone.
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026 · Citations: 0

Automatic Metrics Tool Use

Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai · Feb 12, 2026 · Citations: 0

Automatic Metrics Tool Use

To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.
REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents
Zheng Chu, Xiao Wang, Jack Hong, Huiming Fan, Yuqi Huang · Feb 15, 2026 · Citations: 0

Automatic Metrics Tool Use

To address these challenges, we propose REDSearcher, a unified framework that codesigns complex task synthesis, midtraining, and posttraining for scalable searchagent optimization.
A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026 · Citations: 0

Automatic Metrics Tool Use

To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights.
EnsembleLink: Accurate Record Linkage Without Training Data
Noah Dasanaike · Jan 29, 2026 · Citations: 0

Automatic Metrics Tool Use

On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling.
When Do Tools and Planning Help Large Language Models Think? A Cost- and Latency-Aware Benchmark
Subha Ghoshal, Ali Al-Bustami · Jan 6, 2026 · Citations: 0

Automatic Metrics Tool Use

We benchmark this behavior on two real-world settings: event-centric question answering over graph-structured knowledge (Event-QA) and persuasive response generation in Reddit ChangeMyView (CMV).
PyVision-RL: Forging Open Agentic Vision Models via RL
Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng · Feb 24, 2026 · Citations: 0

Automatic Metrics Tool Use

Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior.
STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models
Jiliang Ni, Jiachen Pu, Zhongyi Yang, Jingfeng Luo, Conggang Hu · Feb 3, 2026 · Citations: 0

Automatic Metrics Tool Use

The proliferation of Large Language Models (LLMs) in function calling is pivotal for creating advanced AI agents, yet their large scale hinders widespread adoption, necessitating transferring their capabilities into smaller ones.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote