HFEPX Metric Hub

Inference Cost + Automatic Metrics Metric Papers (Last 90 Days)

Updated from current HFEPX corpus (Apr 9, 2026). 10 papers are grouped in this metric page.

Read Full Context

Updated from current HFEPX corpus (Apr 9, 2026). 10 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequently cited benchmark: BrowseComp. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 31, 2026.

Papers: 10 Last published: Mar 31, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: Developing .

Metric Coverage

100.0%

10 sampled papers include metric names.

Benchmark Anchoring

40.0%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

0.0%

0 papers report calibration/adjudication/IAA controls.

10 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Treat this as directional signal only; metric reporting is present but benchmark anchoring is still thin.

Why This Matters (Expanded)

Why This Matters For Eval Research

40% of papers report explicit human-feedback signals, led by critique/edit feedback.
automatic metrics appears in 100% of papers in this hub.
BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Metric Interpretation

cost is reported in 100% of hub papers (10/10); compare with a secondary metric before ranking methods.
inference cost is reported in 100% of hub papers (10/10); compare with a secondary metric before ranking methods.

Benchmark Context

BrowseComp appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.
GAIA appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Apr 1, 2026 · Citations: 0 · Score: 8.0

Metrics: Pass@1, Cost · Eval: Automatic Metrics
$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
Apr 1, 2026 · Citations: 0 · Score: 8.0

Metrics: Cost, Inference cost · Eval: Automatic Metrics
Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE
Mar 31, 2026 · Citations: 0 · Score: 7.5

Metrics: Ndcg, Cost · Eval: Automatic Metrics
QED-Nano: Teaching a Tiny Model to Prove Hard Theorems
Apr 6, 2026 · Citations: 0 · Score: 7.5

Metrics: Cost, Inference cost · Eval: Automatic Metrics
Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning
Mar 9, 2026 · Citations: 0 · Score: 7.5

Metrics: Accuracy, Cost · Eval: Automatic Metrics
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Feb 26, 2026 · Citations: 0 · Score: 7.5

Metrics: Accuracy, Latency · Eval: Automatic Metrics

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models Apr 1, 2026	Pass@1, Cost	MATH 500, GSM8K	Automatic Metrics	Not reported
$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution Apr 1, 2026	Cost, Inference cost	Yc Bench	Automatic Metrics	Not reported
Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE Mar 31, 2026	Ndcg, Cost	Not reported	Automatic Metrics	Not reported
QED-Nano: Teaching a Tiny Model to Prove Hard Theorems Apr 6, 2026	Cost, Inference cost	Not reported	Automatic Metrics	Not reported
Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning Mar 9, 2026	Accuracy, Cost	MMLU	Automatic Metrics	Not reported
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization Feb 26, 2026	Accuracy, Latency	GAIA, BrowseComp	Automatic Metrics	Not reported
CAMEL: Confidence-Gated Reflection for Reward Modeling Feb 24, 2026	Accuracy, Cost	Not reported	Automatic Metrics	Not reported
Distilling Feedback into Memory-as-a-Tool Jan 9, 2026	Cost, Inference cost	Not reported	Automatic Metrics	Not reported
Luna-2: Scalable Single-Token Evaluation with Small Language Models Feb 20, 2026	Accuracy, Latency	Not reported	Llm As Judge, Automatic Metrics	Not reported
TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers Feb 18, 2026	Latency, Cost	Not reported	Automatic Metrics	Not reported

Researcher Workflow (Detailed)

Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (40% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Strong: Papers naming benchmarks/datasets

Coverage is strong (40% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (10% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (60% vs 35% target).

Strengths

Most papers provide measurable evaluation context (40% benchmarks, 100% metrics).
Agentic evaluation appears in 50% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10% coverage).
LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (BrowseComp vs GAIA) before comparing methods.
Track metric sensitivity by reporting both cost and inference cost.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: BrowseComp Metric Slice: cost Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Top Metrics

Cost (10)
Inference cost (10)
Accuracy (4)
Latency (3)

Evaluation Modes

Automatic Metrics (10)
Llm As Judge (1)

Top Benchmarks

BrowseComp (1)
GAIA (1)
GSM8K (1)
HumanEval+ (1)

Agentic Mix

Long Horizon (5)

Top Papers Reporting This Metric

Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE
Hejin Huang, Jusheng Zhang, Kaitong Cai, Jian Wang, Rong Pan · Mar 31, 2026 · Citations: 0

Automatic Metrics General

Preference-based alignment objectives have been widely adopted, from RLHF-style pairwise learning in large language models to emerging applications in recommender systems.
QED-Nano: Teaching a Tiny Model to Prove Hard Theorems
LM-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching · Apr 6, 2026 · Citations: 0

Automatic Metrics MathCoding

To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.
CAMEL: Confidence-Gated Reflection for Reward Modeling
Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar · Feb 24, 2026 · Citations: 0

Automatic Metrics General

Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances.
Distilling Feedback into Memory-as-a-Tool
Víctor Gallego · Jan 9, 2026 · Citations: 0

Automatic Metrics General

We propose a framework that amortizes the cost of inference-time reasoning by converting transient critiques into retrievable guidelines, through a file-based memory system and agent-controlled tool calls.
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Jack Young · Apr 1, 2026 · Citations: 0

Automatic Metrics MathCoding

Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval.
Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning
Juming Xiong, Kevin Guo, Congning Ni, Chao Yan, Katherine Brown · Mar 9, 2026 · Citations: 0

Automatic Metrics Math

Recent self-consistency-based approaches further improve accuracy but require sampling and aggregating multiple reasoning trajectories, leading to substantial additional computational overhead.
$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi · Apr 1, 2026 · Citations: 0

Automatic Metrics General

As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound.
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu · Feb 26, 2026 · Citations: 0

Automatic Metrics General

Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios.
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g.
TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers
Ido Levy, Eilam Shapira, Yinon Goldshtein, Avi Yaeli, Nir Mashkif · Feb 18, 2026 · Citations: 0

Automatic Metrics General

We propose TabAgent, a framework for replacing generative decision components in closed-set selection tasks with a compact textual-tabular classifier trained on execution traces.

Related Metric Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote