Skip to content
← Back to explorer

HFEPX Metric Hub

Inference Cost + Automatic Metrics Metric Papers (Last 90 Days)

Updated from current HFEPX corpus (Apr 9, 2026). 10 papers are grouped in this metric page.

Read Full Context

Updated from current HFEPX corpus (Apr 9, 2026). 10 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequently cited benchmark: BrowseComp. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 31, 2026.

Papers: 10 Last published: Mar 31, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: Developing .

Metric Coverage

100.0%

10 sampled papers include metric names.

Benchmark Anchoring

40.0%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

0.0%

0 papers report calibration/adjudication/IAA controls.

  • 10 papers are not low-signal flagged in this sample.
  • Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Treat this as directional signal only; metric reporting is present but benchmark anchoring is still thin.

Why This Matters (Expanded)

Why This Matters For Eval Research

  • 40% of papers report explicit human-feedback signals, led by critique/edit feedback.
  • automatic metrics appears in 100% of papers in this hub.
  • BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.
Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

  • Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
  • Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
  • Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Metric Interpretation

  • cost is reported in 100% of hub papers (10/10); compare with a secondary metric before ranking methods.
  • inference cost is reported in 100% of hub papers (10/10); compare with a secondary metric before ranking methods.

Benchmark Context

  • BrowseComp appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.
  • GAIA appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper Metrics Benchmarks Eval Modes Quality Controls
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

Apr 1, 2026

Pass@1, Cost MATH 500, GSM8K Automatic Metrics Not reported
$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

Apr 1, 2026

Cost, Inference cost Yc Bench Automatic Metrics Not reported
Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE

Mar 31, 2026

Ndcg, Cost Not reported Automatic Metrics Not reported
QED-Nano: Teaching a Tiny Model to Prove Hard Theorems

Apr 6, 2026

Cost, Inference cost Not reported Automatic Metrics Not reported
Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning

Mar 9, 2026

Accuracy, Cost MMLU Automatic Metrics Not reported
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Feb 26, 2026

Accuracy, Latency GAIA, BrowseComp Automatic Metrics Not reported
CAMEL: Confidence-Gated Reflection for Reward Modeling

Feb 24, 2026

Accuracy, Cost Not reported Automatic Metrics Not reported
Distilling Feedback into Memory-as-a-Tool

Jan 9, 2026

Cost, Inference cost Not reported Automatic Metrics Not reported
Luna-2: Scalable Single-Token Evaluation with Small Language Models

Feb 20, 2026

Accuracy, Latency Not reported Llm As Judge, Automatic Metrics Not reported
TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers

Feb 18, 2026

Latency, Cost Not reported Automatic Metrics Not reported
Researcher Workflow (Detailed)

Checklist

  • Moderate: Papers with explicit human feedback

    Coverage is usable but incomplete (40% vs 45% target).

  • Gap: Papers reporting quality controls

    Coverage is a replication risk (0% vs 30% target).

  • Strong: Papers naming benchmarks/datasets

    Coverage is strong (40% vs 35% target).

  • Strong: Papers naming evaluation metrics

    Coverage is strong (100% vs 35% target).

  • Gap: Papers with known rater population

    Coverage is a replication risk (10% vs 35% target).

  • Strong: Papers with known annotation unit

    Coverage is strong (60% vs 35% target).

Strengths

  • Most papers provide measurable evaluation context (40% benchmarks, 100% metrics).
  • Agentic evaluation appears in 50% of papers.

Known Gaps

  • Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (10% coverage).
  • LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

  • Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
  • Stratify by benchmark (BrowseComp vs GAIA) before comparing methods.
  • Track metric sensitivity by reporting both cost and inference cost.

Recommended Queries

Known Limitations
  • Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (10% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Research Utility Snapshot (Detailed)

Top Metrics

  • Cost (10)
  • Inference cost (10)
  • Accuracy (4)
  • Latency (3)

Evaluation Modes

  • Automatic Metrics (10)
  • Llm As Judge (1)

Top Benchmarks

  • BrowseComp (1)
  • GAIA (1)
  • GSM8K (1)
  • HumanEval+ (1)

Agentic Mix

  • Long Horizon (5)

Top Papers Reporting This Metric

  • Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE

    Hejin Huang, Jusheng Zhang, Kaitong Cai, Jian Wang, Rong Pan · Mar 31, 2026 · Citations: 0

    Automatic Metrics General

    Preference-based alignment objectives have been widely adopted, from RLHF-style pairwise learning in large language models to emerging applications in recommender systems.

  • QED-Nano: Teaching a Tiny Model to Prove Hard Theorems

    LM-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching · Apr 6, 2026 · Citations: 0

    Automatic Metrics MathCoding

    To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.

  • CAMEL: Confidence-Gated Reflection for Reward Modeling

    Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar · Feb 24, 2026 · Citations: 0

    Automatic Metrics General

    Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances.

  • Distilling Feedback into Memory-as-a-Tool

    Víctor Gallego · Jan 9, 2026 · Citations: 0

    Automatic Metrics General

    We propose a framework that amortizes the cost of inference-time reasoning by converting transient critiques into retrievable guidelines, through a file-based memory system and agent-controlled tool calls.

  • S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

    Jack Young · Apr 1, 2026 · Citations: 0

    Automatic Metrics MathCoding

    Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval.

  • Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning

    Juming Xiong, Kevin Guo, Congning Ni, Chao Yan, Katherine Brown · Mar 9, 2026 · Citations: 0

    Automatic Metrics Math

    Recent self-consistency-based approaches further improve accuracy but require sampling and aggregating multiple reasoning trajectories, leading to substantial additional computational overhead.

  • $\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

    Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi · Apr 1, 2026 · Citations: 0

    Automatic Metrics General

    As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound.

  • Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

    Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu · Feb 26, 2026 · Citations: 0

    Automatic Metrics General

    Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios.

  • Luna-2: Scalable Single-Token Evaluation with Small Language Models

    Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026 · Citations: 0

    Llm As JudgeAutomatic Metrics General

    We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g.

  • TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers

    Ido Levy, Eilam Shapira, Yinon Goldshtein, Avi Yaeli, Nir Mashkif · Feb 18, 2026 · Citations: 0

    Automatic Metrics General

    We propose TabAgent, a framework for replacing generative decision components in closed-set selection tasks with a compact textual-tabular classifier trained on execution traces.

Related Metric Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.