HFEPX Metric Hub

Inference Cost In CS.AI Papers

Updated from current HFEPX corpus (Apr 9, 2026). 14 papers are grouped in this metric page.

Read Full Context

Updated from current HFEPX corpus (Apr 9, 2026). 14 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Llm As Judge. Common annotation unit: Multi Dim Rubric. Frequently cited benchmark: Yc-Bench. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 6, 2026.

Papers: 14 Last published: Apr 6, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: Medium .

Metric Coverage

35.7%

5 sampled papers include metric names.

Benchmark Anchoring

7.1%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

0.0%

0 papers report calibration/adjudication/IAA controls.

14 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Treat this as directional signal only; metric reporting is present but benchmark anchoring is still thin.

Why This Matters (Expanded)

Why This Matters For Eval Research

14.3% of papers report explicit human-feedback signals, led by critique/edit feedback.
automatic metrics appears in 28.6% of papers in this hub.
Yc-Bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly unspecified rater pools, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Metric Interpretation

cost is reported in 100% of hub papers (14/14); compare with a secondary metric before ranking methods.
inference cost is reported in 100% of hub papers (14/14); compare with a secondary metric before ranking methods.

Benchmark Context

Yc-Bench appears in 7.1% of hub papers (1/14); use this cohort for benchmark-matched comparisons.

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
Apr 1, 2026 · Citations: 0 · Score: 8.0

Metrics: Cost, Inference cost · Eval: Automatic Metrics
QED-Nano: Teaching a Tiny Model to Prove Hard Theorems
Apr 6, 2026 · Citations: 0 · Score: 7.5

Metrics: Cost, Inference cost · Eval: Automatic Metrics
CAMEL: Confidence-Gated Reflection for Reward Modeling
Feb 24, 2026 · Citations: 0 · Score: 7.0

Metrics: Accuracy, Cost · Eval: Automatic Metrics
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Feb 20, 2026 · Citations: 0 · Score: 6.0

Metrics: Accuracy, Latency · Eval: Llm As Judge, Automatic Metrics
"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation
Jun 4, 2025 · Citations: 0 · Score: 5.0

Metrics: Cost · Eval: Simulation Env
GAIN: Multiplicative Modulation for Domain Adaptation
Apr 6, 2026 · Citations: 0 · Score: 0.0

Metrics: Not reported · Eval: Not reported

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution Apr 1, 2026	Cost, Inference cost	Yc Bench	Automatic Metrics	Not reported
QED-Nano: Teaching a Tiny Model to Prove Hard Theorems Apr 6, 2026	Cost, Inference cost	Not reported	Automatic Metrics	Not reported
CAMEL: Confidence-Gated Reflection for Reward Modeling Feb 24, 2026	Accuracy, Cost	Not reported	Automatic Metrics	Not reported
Luna-2: Scalable Single-Token Evaluation with Small Language Models Feb 20, 2026	Accuracy, Latency	Not reported	Llm As Judge, Automatic Metrics	Not reported
"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation Jun 4, 2025	Cost	Not reported	Simulation Env	Not reported
GAIN: Multiplicative Modulation for Domain Adaptation Apr 6, 2026	Not reported	Not reported	Not reported	Not reported
VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions Mar 24, 2026	Not reported	Not reported	Not reported	Not reported
ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models Mar 22, 2026	Not reported	Not reported	Not reported	Not reported
Post-Training Local LLM Agents for Linux Privilege Escalation with Verifiable Rewards Mar 18, 2026	Not reported	Not reported	Not reported	Not reported
Auto-Unrolled Proximal Gradient Descent: An AutoML Approach to Interpretable Waveform Optimization Mar 18, 2026	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (14.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (7.1% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (0% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (14.3% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Annotation unit is under-specified (14.3% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Track metric sensitivity by reporting both cost and inference cost.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: Yc-Bench Metric Slice: cost Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Top Metrics

Cost (14)
Inference cost (14)
Accuracy (4)
Latency (3)

Evaluation Modes

Automatic Metrics (4)
Llm As Judge (1)
Simulation Env (1)

Top Benchmarks

Yc Bench (1)

Agentic Mix

Long Horizon (1)
Web Browsing (1)

Top Papers Reporting This Metric

QED-Nano: Teaching a Tiny Model to Prove Hard Theorems
LM-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching · Apr 6, 2026 · Citations: 0

Automatic Metrics MathCoding

To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.
CAMEL: Confidence-Gated Reflection for Reward Modeling
Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar · Feb 24, 2026 · Citations: 0

Automatic Metrics General

Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances.
"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation
Amin Seffo, Aladin Djuhera, Masataro Asai, Holger Boche · Jun 4, 2025 · Citations: 0

Simulation Env MathCoding

Recent advancements in large language models (LLMs) have spurred interest in robotic navigation that incorporates complex spatial, mathematical, and conditional constraints from natural language into the planning problem.
$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi · Apr 1, 2026 · Citations: 0

Automatic Metrics General

As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound.
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g.
GAIN: Multiplicative Modulation for Domain Adaptation
Hengshuai Yao, Xing Chen, Ahmed Murtadha, Guan Wang · Apr 6, 2026 · Citations: 0
VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions
Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Yassine Ouali, Georgios Tzimiropoulos · Mar 24, 2026 · Citations: 0
ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models
Haoyu Qiao, Hao Zhang, Shanwen Mao, Siyao Cheng, Jie Liu · Mar 22, 2026 · Citations: 0
Post-Training Local LLM Agents for Linux Privilege Escalation with Verifiable Rewards
Philipp Normann, Andreas Happe, Jürgen Cito, Daniel Arp · Mar 18, 2026 · Citations: 0
Auto-Unrolled Proximal Gradient Descent: An AutoML Approach to Interpretable Waveform Optimization
Ahmet Kaplan · Mar 18, 2026 · Citations: 0
Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs
Disha Sheshanarayana, Rajat Subhra Pal, Manjira Sinha, Tirthankar Dasgupta · Mar 16, 2026 · Citations: 0
Efficient and Interpretable Multi-Agent LLM Routing via Ant Colony Optimization
Xudong Wang, Chaoning Zhang, Jiaquan Zhang, Chenghao Li, Qigan Sun · Mar 13, 2026 · Citations: 0
Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability
Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang, Kim-Chuan Toh · Mar 12, 2026 · Citations: 0
UniPrompt-CL: Sustainable Continual Learning in Medical AI with Unified Prompt Pools
Gyutae Oh, Jitae Shin · Aug 14, 2025 · Citations: 0

Related Metric Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote