Metric Hub

Latency + Coding Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 20 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: MATH. Common metric signal: latency. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 20 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 20 papers for Latency + Coding Metric Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on MATH, AIME and metric focus on latency, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

15% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders , Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
automatic metrics appears in 90% of papers in this hub.

Evidence: Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
MATH is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters , CDLM: Consistency Diffusion Language Models For Faster Sampling , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Protocol Takeaways

Most common quality-control signal is rater calibration (5% of papers).

Evidence: Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
Stratify by benchmark (MATH vs AIME) before comparing methods.

Evidence: Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators

Benchmark Interpretation

MATH appears in 10% of hub papers (2/20); use this cohort for benchmark-matched comparisons.
AIME appears in 5% of hub papers (1/20); use this cohort for benchmark-matched comparisons.

Metric Interpretation

latency is reported in 100% of hub papers (20/20); compare with a secondary metric before ranking methods.
accuracy is reported in 35% of hub papers (7/20); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (15% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (5% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (50% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (10% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (5% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (15% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (5% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (50% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (10% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (5% vs 35% target).

Known Limitations

Only 5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: MATH - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: latency - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=0, left_only=18, right_only=2

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

MATH

Coverage: 2 papers (10%)

2 papers (10%) mention MATH.

Examples: Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters , CDLM: Consistency Diffusion Language Models For Faster Sampling

Benchmark Brief

AIME

Coverage: 1 papers (5%)

1 papers (5%) mention AIME.

Examples: ATTS: Asynchronous Test-Time Scaling via Conformal Prediction

Benchmark Brief

BrowseComp

Coverage: 1 papers (5%)

1 papers (5%) mention BrowseComp.

Examples: Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Metric Brief

latency

Coverage: 20 papers (100%)

20 papers (100%) mention latency.

Examples: Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

Metric Brief

accuracy

Coverage: 7 papers (35%)

7 papers (35%) mention accuracy.

Examples: InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Luna-2: Scalable Single-Token Evaluation with Small Language Models

Metric Brief

cost

Coverage: 7 papers (35%)

7 papers (35%) mention cost.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu · Feb 26, 2026

Automatic Metrics MathMedicine

Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases.
InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross · Feb 26, 2026

Automatic Metrics MathCoding

Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026

Automatic Metrics MathCoding

This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt · Feb 26, 2026

Automatic Metrics Coding

In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval.
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026

Automatic Metrics Coding

Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi · Feb 24, 2026

Automatic Metrics MathCoding

Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning.
HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders
Kun Yuan, Junyu Bi, Daixuan Cheng, Changfa Wu, Shuwen Xiao · Feb 24, 2026

Automatic Metrics Coding

Modern recommender systems leverage ultra-long user behavior sequences to capture dynamic preferences, but end-to-end modeling is infeasible in production due to latency and memory constraints.
CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference
Chao Fei, Guozhong Li, Chenxi Liu, Panos Kalnis · Feb 24, 2026

Automatic Metrics Coding

Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1\%} of the KV cache, delivers low-latency stable inference with up to \textbf{4.56$\times$} higher throughput, and consistently outperforms other str
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026

Automatic Metrics Coding

Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation.
Query as Anchor: Scenario-Adaptive User Representation via Large Language Model
Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Ziyi Gao · Feb 16, 2026

Automatic Metrics Coding

Evaluations on 10 Alipay industrial benchmarks show consistent SOTA performance, strong scalability, and efficient deployment.
From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design
Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen · Feb 14, 2026

Simulation Env Coding

We introduce LaySPA, a reinforcement learning framework that equips large language models (LLMs) with explicit and interpretable spatial reasoning for content-aware graphic layout design.
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai · Feb 12, 2026

Automatic Metrics Coding

To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao · Feb 11, 2026

Simulation Env MathCoding

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Siheng Chen · Feb 8, 2026

Automatic Metrics Coding

Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons.
Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts
Wenhao Li, Daohai Yu, Gen Luo, Yuxin Zhang, Fei Chao · Feb 2, 2026

Automatic Metrics Coding

Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time.
CDLM: Consistency Diffusion Language Models For Faster Sampling
Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun · Nov 24, 2025

Automatic Metrics MathCoding

The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.
OckBench: Measuring the Efficiency of LLM Reasoning
Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu · Nov 7, 2025

Automatic Metrics Coding

Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: efficiency of token usage.
Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Yinrong Hong, Zhiquan Tan, Kai Hu · Oct 30, 2025

Automatic Metrics Coding

Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size.
ATTS: Asynchronous Test-Time Scaling via Conformal Prediction
Jing Xiong, Qiujiang Chen, Fanghua Ye, Zhongwei Wan, Chuanyang Zheng · Sep 18, 2025

Automatic Metrics MathCoding

Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency.
SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML
Ismail Lamaakal, Chaymae Yahyati, Khalid El Makkaoui, Ibrahim Ouahbi, Yassine Maleh · Aug 18, 2025

Automatic Metrics Coding

Reliable uncertainty estimation is a key missing piece for on-device monitoring in TinyML: microcontrollers must detect failures, distribution shift, or accuracy drops under strict flash/latency budgets, yet common uncertainty approaches (d

Latency + Coding Metric Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs