HFEPX Metric Hub

Inference Cost In CS.LG Papers

Updated from current HFEPX corpus (Apr 9, 2026). 11 papers are grouped in this metric page.

Read Full Context

Updated from current HFEPX corpus (Apr 9, 2026). 11 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Llm As Judge. Common annotation unit: Multi Dim Rubric. Frequently cited benchmark: GSM8K. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 6, 2026.

Papers: 11 Last published: Apr 6, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: Developing .

Metric Coverage

27.3%

3 sampled papers include metric names.

Benchmark Anchoring

9.1%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

0.0%

0 papers report calibration/adjudication/IAA controls.

11 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Treat this as directional signal only; metric reporting is present but benchmark anchoring is still thin.

Why This Matters (Expanded)

Why This Matters For Eval Research

9.1% of papers report explicit human-feedback signals, led by rubric ratings.
automatic metrics appears in 27.3% of papers in this hub.
GSM8K is a recurring benchmark anchor for cross-paper comparisons in this page.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly unspecified rater pools, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Metric Interpretation

cost is reported in 100% of hub papers (11/11); compare with a secondary metric before ranking methods.
inference cost is reported in 100% of hub papers (11/11); compare with a secondary metric before ranking methods.

Benchmark Context

GSM8K appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.
HumanEval+ appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Apr 1, 2026 · Citations: 0 · Score: 8.0

Metrics: Pass@1, Cost · Eval: Automatic Metrics
QED-Nano: Teaching a Tiny Model to Prove Hard Theorems
Apr 6, 2026 · Citations: 0 · Score: 7.5

Metrics: Cost, Inference cost · Eval: Automatic Metrics
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Feb 20, 2026 · Citations: 0 · Score: 6.0

Metrics: Accuracy, Latency · Eval: Llm As Judge, Automatic Metrics
Are Latent Reasoning Models Easily Interpretable?
Apr 6, 2026 · Citations: 0 · Score: 0.0

Metrics: Not reported · Eval: Not reported
GAIN: Multiplicative Modulation for Domain Adaptation
Apr 6, 2026 · Citations: 0 · Score: 0.0

Metrics: Not reported · Eval: Not reported
VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions
Mar 24, 2026 · Citations: 0 · Score: 0.0

Metrics: Not reported · Eval: Not reported

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models Apr 1, 2026	Pass@1, Cost	MATH 500, GSM8K	Automatic Metrics	Not reported
QED-Nano: Teaching a Tiny Model to Prove Hard Theorems Apr 6, 2026	Cost, Inference cost	Not reported	Automatic Metrics	Not reported
Luna-2: Scalable Single-Token Evaluation with Small Language Models Feb 20, 2026	Accuracy, Latency	Not reported	Llm As Judge, Automatic Metrics	Not reported
Are Latent Reasoning Models Easily Interpretable? Apr 6, 2026	Not reported	Not reported	Not reported	Not reported
GAIN: Multiplicative Modulation for Domain Adaptation Apr 6, 2026	Not reported	Not reported	Not reported	Not reported
VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions Mar 24, 2026	Not reported	Not reported	Not reported	Not reported
Auto-Unrolled Proximal Gradient Descent: An AutoML Approach to Interpretable Waveform Optimization Mar 18, 2026	Not reported	Not reported	Not reported	Not reported
Ensemble Self-Training for Unsupervised Machine Translation Mar 17, 2026	Not reported	Not reported	Not reported	Not reported
Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs Mar 16, 2026	Not reported	Not reported	Not reported	Not reported
Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability Mar 12, 2026	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (9.1% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (9.1% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (0% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (18.2% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Annotation unit is under-specified (18.2% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (GSM8K vs HumanEval+) before comparing methods.
Track metric sensitivity by reporting both cost and inference cost.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: GSM8K Metric Slice: cost Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Top Metrics

Cost (11)
Inference cost (11)
Accuracy (3)
Throughput (2)

Evaluation Modes

Automatic Metrics (3)
Llm As Judge (1)

Top Benchmarks

GSM8K (1)
HumanEval+ (1)
MATH 500 (1)
Spider (1)

Agentic Mix

Long Horizon (1)

Top Papers Reporting This Metric

QED-Nano: Teaching a Tiny Model to Prove Hard Theorems
LM-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching · Apr 6, 2026 · Citations: 0

Automatic Metrics MathCoding

To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Jack Young · Apr 1, 2026 · Citations: 0

Automatic Metrics MathCoding

Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval.
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g.
Are Latent Reasoning Models Easily Interpretable?
Connor Dilgren, Sarah Wiegreffe · Apr 6, 2026 · Citations: 0
GAIN: Multiplicative Modulation for Domain Adaptation
Hengshuai Yao, Xing Chen, Ahmed Murtadha, Guan Wang · Apr 6, 2026 · Citations: 0
VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions
Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Yassine Ouali, Georgios Tzimiropoulos · Mar 24, 2026 · Citations: 0
Auto-Unrolled Proximal Gradient Descent: An AutoML Approach to Interpretable Waveform Optimization
Ahmet Kaplan · Mar 18, 2026 · Citations: 0
Ensemble Self-Training for Unsupervised Machine Translation
Ido Aharon, Jonathan Shaki, Sarit Kraus · Mar 17, 2026 · Citations: 0
Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs
Disha Sheshanarayana, Rajat Subhra Pal, Manjira Sinha, Tirthankar Dasgupta · Mar 16, 2026 · Citations: 0
Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability
Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang, Kim-Chuan Toh · Mar 12, 2026 · Citations: 0
UniPrompt-CL: Sustainable Continual Learning in Medical AI with Unified Prompt Pools
Gyutae Oh, Jitae Shin · Aug 14, 2025 · Citations: 0

Related Metric Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote