Metric Hub

Inference Cost In CS.CL Papers

Updated from current HFEPX corpus (Feb 27, 2026). 10 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Scalar. Frequent quality control: Calibration. Frequently cited benchmark: BrowseComp. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 10 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 10 papers for Inference Cost In CS.CL Papers. Dominant protocol signals include automatic metrics, with frequent benchmark focus on BrowseComp, GAIA and metric focus on cost, inference cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

10% of papers report explicit human-feedback signals, led by critique/edit feedback.

Evidence: CAMEL: Confidence-Gated Reflection for Reward Modeling , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Luna-2: Scalable Single-Token Evaluation with Small Language Models
automatic metrics appears in 100% of papers in this hub.

Evidence: Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , CAMEL: Confidence-Gated Reflection for Reward Modeling , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Luna-2: Scalable Single-Token Evaluation with Small Language Models
BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , CAMEL: Confidence-Gated Reflection for Reward Modeling , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Luna-2: Scalable Single-Token Evaluation with Small Language Models

Protocol Takeaways

Most common quality-control signal is rater calibration (20% of papers).

Evidence: Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , GRASP: Replace Redundant Layers with Adaptive Singular Parameters for Efficient Model Compression , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , CAMEL: Confidence-Gated Reflection for Reward Modeling
Rater context is mostly domain experts, and annotation is commonly scalar scoring; use this to scope replication staffing.

Evidence: Cost-of-Pass: An Economic Framework for Evaluating Language Models , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , CAMEL: Confidence-Gated Reflection for Reward Modeling , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Stratify by benchmark (BrowseComp vs GAIA) before comparing methods.

Evidence: Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , CAMEL: Confidence-Gated Reflection for Reward Modeling , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Luna-2: Scalable Single-Token Evaluation with Small Language Models

Benchmark Interpretation

BrowseComp appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.
GAIA appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

cost is reported in 100% of hub papers (10/10); compare with a secondary metric before ranking methods.
inference cost is reported in 100% of hub papers (10/10); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (10% vs 45% target).
Tighten coverage on Papers reporting quality controls. Coverage is usable but incomplete (20% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (20% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (10% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (20% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (10% vs 45% target).

Papers reporting quality controls

Coverage is usable but incomplete (20% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (20% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (10% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (20% vs 35% target).

Known Limitations

Rater population is under-specified (10% coverage).
Annotation unit is under-specified (20% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: BrowseComp - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: cost - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

Benchmark Brief

BrowseComp

Coverage: 1 papers (10%)

1 papers (10%) mention BrowseComp.

Examples: Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Benchmark Brief

GAIA

Coverage: 1 papers (10%)

1 papers (10%) mention GAIA.

Examples: Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Benchmark Brief

GSM8K

Coverage: 1 papers (10%)

1 papers (10%) mention GSM8K.

Examples: Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Metric Brief

cost

Coverage: 10 papers (100%)

10 papers (100%) mention cost.

Examples: Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , CAMEL: Confidence-Gated Reflection for Reward Modeling , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Metric Brief

inference cost

Coverage: 10 papers (100%)

10 papers (100%) mention inference cost.

Metric Brief

accuracy

Coverage: 5 papers (50%)

5 papers (50%) mention accuracy.

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , CAMEL: Confidence-Gated Reflection for Reward Modeling , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu · Feb 26, 2026

Automatic Metrics General

Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios.
CAMEL: Confidence-Gated Reflection for Reward Modeling
Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar · Feb 24, 2026

Automatic Metrics General

Reward models play a fundamental role in aligning large language models with human preferences.
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled · Feb 23, 2026

Automatic Metrics Math

In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026

Automatic Metrics Coding

Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation.
Sink-Aware Pruning for Diffusion Language Models
Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen · Feb 19, 2026

Automatic Metrics Coding

Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning.
TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers
Ido Levy, Eilam Shapira, Yinon Goldshtein, Avi Yaeli, Nir Mashkif · Feb 18, 2026

Automatic Metrics General

Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating,
Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Yinrong Hong, Zhiquan Tan, Kai Hu · Oct 30, 2025

Automatic Metrics Coding

Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size.
PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space
Boyi Zeng, He Li, Shixiang Song, Yixuan Wang, Ziwei He · Sep 27, 2025

Automatic Metrics Coding

The remarkable success of Chain-of-Thought (CoT), which enhances performance by scaling generation steps at test-time, inspires us to ask: can we leverage a similar scaling of computational steps during pretraining to improve the generation
Cost-of-Pass: An Economic Framework for Evaluating Language Models
Mehmet Hamza Erol, Batu El, Mirac Suzgun, Mert Yuksekgonul, James Zou · Apr 17, 2025

Automatic Metrics General

We then define the frontier cost-of-pass: the minimum cost-of-pass achievable across available models or the human-expert(s), using the approx.
GRASP: Replace Redundant Layers with Adaptive Singular Parameters for Efficient Model Compression
Kainan Liu, Yong Zhang, Ning Cheng, Zhitao Li, Shaojun Wang · Dec 31, 2024

Automatic Metrics General

Recent studies have demonstrated that many layers are functionally redundant in large language models (LLMs), enabling model compression by removing these layers to reduce inference cost.

Inference Cost In CS.CL Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs