Metric Hub

Latency In CS.LG Papers

Updated from current HFEPX corpus (Feb 27, 2026). 16 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Frequent quality control: Calibration. Frequently cited benchmark: GSM8K. Common metric signal: latency. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 16 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 16 papers for Latency In CS.LG Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on GSM8K, DROP and metric focus on latency, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

6.3% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
automatic metrics appears in 93.8% of papers in this hub.

Evidence: InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
GSM8K is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents

Protocol Takeaways

Most common quality-control signal is rater calibration (12.5% of papers).

Evidence: Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Discrete Stochastic Localization for Non-autoregressive Generation , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.

Evidence: SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
Stratify by benchmark (GSM8K vs DROP) before comparing methods.

Evidence: InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Benchmark Interpretation

GSM8K appears in 12.5% of hub papers (2/16); use this cohort for benchmark-matched comparisons.
DROP appears in 6.3% of hub papers (1/16); use this cohort for benchmark-matched comparisons.

Metric Interpretation

latency is reported in 100% of hub papers (16/16); compare with a secondary metric before ranking methods.
accuracy is reported in 43.8% of hub papers (7/16); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (6.3% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (12.5% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (56.3% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (6.3% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (6.3% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (12.5% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (56.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (6.3% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Known Limitations

Only 12.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (6.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: GSM8K - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: latency - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=0, left_only=15, right_only=1

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

GSM8K

Coverage: 2 papers (12.5%)

2 papers (12.5%) mention GSM8K.

Examples: InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Benchmark Brief

DROP

Coverage: 1 papers (6.3%)

1 papers (6.3%) mention DROP.

Examples: SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML

Benchmark Brief

MATH

Coverage: 1 papers (6.3%)

1 papers (6.3%) mention MATH.

Examples: CDLM: Consistency Diffusion Language Models For Faster Sampling

Metric Brief

latency

Coverage: 16 papers (100%)

16 papers (100%) mention latency.

Examples: InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents

Metric Brief

accuracy

Coverage: 7 papers (43.8%)

7 papers (43.8%) mention accuracy.

Examples: InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Luna-2: Scalable Single-Token Evaluation with Small Language Models

Metric Brief

cost

Coverage: 6 papers (37.5%)

6 papers (37.5%) mention cost.

Examples: SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross · Feb 26, 2026

Automatic Metrics MathCoding

Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.
Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt · Feb 26, 2026

Automatic Metrics Coding

In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval.
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026

Automatic Metrics Coding

Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi · Feb 24, 2026

Automatic Metrics MathCoding

Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning.
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled · Feb 23, 2026

Automatic Metrics Math

In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026

Automatic Metrics Coding

Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation.
Discrete Stochastic Localization for Non-autoregressive Generation
Yunshu Wu, Jiayi Cheng, Partha Thakuria, Rob Brekelmans, Evangelos E. Papalexakis · Feb 18, 2026

Automatic Metrics General

On OpenWebText, \textsc{DSL} fine-tuning yields large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM baseline with $\sim$4$\times$ fewer denoiser evaluations, and matches autoregressive quality at high budgets.
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai · Feb 12, 2026

Automatic Metrics Coding

To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.
AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Siheng Chen · Feb 8, 2026

Automatic Metrics Coding

Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons.
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026

Simulation Env General

Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
CDLM: Consistency Diffusion Language Models For Faster Sampling
Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun · Nov 24, 2025

Automatic Metrics MathCoding

The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.
Intelligence per Watt: Measuring Intelligence Efficiency of Local AI
Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya · Nov 11, 2025

Automatic Metrics General

Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure.
Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Yinrong Hong, Zhiquan Tan, Kai Hu · Oct 30, 2025

Automatic Metrics Coding

Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size.
SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML
Ismail Lamaakal, Chaymae Yahyati, Khalid El Makkaoui, Ibrahim Ouahbi, Yassine Maleh · Aug 18, 2025

Automatic Metrics Coding

Reliable uncertainty estimation is a key missing piece for on-device monitoring in TinyML: microcontrollers must detect failures, distribution shift, or accuracy drops under strict flash/latency budgets, yet common uncertainty approaches (d
$\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts
Mert Cemri, Nived Rajaraman, Rishabh Tiwari, Xiaoxuan Liu, Kurt Keutzer · Jun 15, 2025

Automatic Metrics Math

Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs), typically by allocating additional computation for more thorough exploration.
vCache: Verified Semantic Prompt Caching
Luis Gaspar Schroeder, Aditya Desai, Alejandro Cuadron, Kyle Chu, Shu Liu · Feb 6, 2025

Automatic Metrics General

We release the vCache implementation and four benchmarks to support future research.

Latency In CS.LG Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs