Metric Hub

Cost + Coding Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 32 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Retrieval. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 32 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 32 papers for Cost + Coding Metric Papers. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval, ALFWorld and metric focus on cost, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

12.5% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
automatic metrics appears in 81.3% of papers in this hub.

Evidence: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

Protocol Takeaways

Most common quality-control signal is inter-annotator agreement reporting (3.1% of papers).

Evidence: Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

Benchmark Interpretation

Retrieval appears in 12.5% of hub papers (4/32); use this cohort for benchmark-matched comparisons.
ALFWorld appears in 3.1% of hub papers (1/32); use this cohort for benchmark-matched comparisons.

Metric Interpretation

cost is reported in 100% of hub papers (32/32); compare with a secondary metric before ranking methods.
accuracy is reported in 25% of hub papers (8/32); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (12.5% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (3.1% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (25% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (9.4% vs 35% target).
Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (21.9% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (12.5% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (3.1% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (9.4% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (21.9% vs 35% target).

Known Limitations

Only 3.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.4% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: cost - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=0, left_only=1, right_only=26

0 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=0, left_only=26, right_only=5

0 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=0, left_only=5, right_only=1

0 papers use both Simulation Env and Human Eval.

Benchmark Brief

Retrieval

Coverage: 4 papers (12.5%)

4 papers (12.5%) mention Retrieval.

Examples: Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents , Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering , Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Benchmark Brief

ALFWorld

Coverage: 1 papers (3.1%)

1 papers (3.1%) mention ALFWorld.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Benchmark Brief

BrowseComp

Coverage: 1 papers (3.1%)

1 papers (3.1%) mention BrowseComp.

Examples: Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Metric Brief

cost

Coverage: 32 papers (100%)

32 papers (100%) mention cost.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

Metric Brief

accuracy

Coverage: 8 papers (25%)

8 papers (25%) mention accuracy.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics

Metric Brief

latency

Coverage: 7 papers (21.9%)

7 papers (21.9%) mention latency.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026

Automatic Metrics MathCoding

This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper
Hoan My Tran, Xin Wang, Wanying Ge, Xuechen Liu, Junichi Yamagishi · Feb 26, 2026

Automatic Metrics Coding

Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized by speech generative models.
How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu · Feb 25, 2026

Automatic Metrics Coding

Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space.
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026

Automatic Metrics Coding

Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi · Feb 24, 2026

Automatic Metrics MathCoding

Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning.
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026

Automatic Metrics Coding

Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders
Kun Yuan, Junyu Bi, Daixuan Cheng, Changfa Wu, Shuwen Xiao · Feb 24, 2026

Automatic Metrics Coding

Modern recommender systems leverage ultra-long user behavior sequences to capture dynamic preferences, but end-to-end modeling is infeasible in production due to latency and memory constraints.
Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics
Yue Pan, Xingyao Wang, Hanyue Zhang, Liwei Liu, Changxin Li · Feb 23, 2026

Automatic Metrics MedicineCoding

The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings.
Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding
Roberto Tacconelli · Feb 23, 2026

Automatic Metrics Coding

An out-of-distribution (OOD) evaluation on a document published after the model's training cutoff confirms these gains are not memorization artifacts, achieving 0.723 bpb on unseen text.
Can Large Language Models Replace Human Coders? Introducing ContentBench
Michael Haman · Feb 23, 2026

Automatic Metrics Coding

This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.
Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng · Feb 22, 2026

Automatic Metrics MathCoding

Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.
Watermarking LLM Agent Trajectories
Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li · Feb 21, 2026

Automatic Metrics MathCoding

LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering.
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026

Automatic Metrics Coding

Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation.
Sink-Aware Pruning for Diffusion Language Models
Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen · Feb 19, 2026

Automatic Metrics Coding

Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning.
Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents
Wenxuan Ding, Nicholas Tomlin, Greg Durrett · Feb 18, 2026

Simulation Env Coding

Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent.
Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification
Taja Kuzman Pungeršek, Peter Rupnik, Daniela Širinić, Nikola Ljubešić · Feb 18, 2026

Human Eval CodingMultilingual

Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data.
MAEB: Massive Audio Embedding Benchmark
Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha · Feb 17, 2026

Simulation Env CodingMultilingual

We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages.
Extracting Consumer Insight from Text: A Large Language Model Approach to Emotion and Evaluation Measurement
Stephan Ludwig, Peter J. Danaher, Xiaohao Yang, Yu-Ting Lin, Ehsan Abedin · Feb 17, 2026

Automatic Metrics Coding

Accurately measuring consumer emotions and evaluations from unstructured text remains a core challenge for marketing research and practice.
Breaking Data Efficiency Dilemma: A Federated and Augmented Learning Framework For Alzheimer's Disease Detection via Speech
Xiao Wei, Bin Wen, Yuqin Lin, Kai Li, Mingyang gu · Feb 16, 2026

Automatic Metrics MedicineCoding

Early diagnosis of Alzheimer's Disease (AD) is crucial for delaying its progression.
Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Tao Xu · Feb 15, 2026

Automatic Metrics Coding

16.1\% (+14.5pp); on CircuitVQA, a public benchmark (9,315 questions), retrieval ImgR@3 achieves 31.2\% vs.
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao · Feb 11, 2026

Simulation Env MathCoding

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026

Simulation Env Coding

While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.
QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models
Maximilian Kreutner, Jens Rupprecht, Georg Ahnert, Ahmed Salem, Markus Strohmaier · Dec 9, 2025

Automatic Metrics Coding

QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods.
Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li · Dec 3, 2025

Automatic Metrics Coding

Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths.
Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Yinrong Hong, Zhiquan Tan, Kai Hu · Oct 30, 2025

Automatic Metrics Coding

Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size.
PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space
Boyi Zeng, He Li, Shixiang Song, Yixuan Wang, Ziwei He · Sep 27, 2025

Automatic Metrics Coding

The remarkable success of Chain-of-Thought (CoT), which enhances performance by scaling generation steps at test-time, inspires us to ask: can we leverage a similar scaling of computational steps during pretraining to improve the generation
Diffusion Language Models Know the Answer Before Decoding
Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan · Aug 27, 2025

Automatic Metrics MathCoding

Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality.
Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay
Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang · Jun 5, 2025

Automatic Metrics Coding

Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities.
"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation
Amin Seffo, Aladin Djuhera, Masataro Asai, Holger Boche · Jun 4, 2025

Simulation Env MathCoding

Recent advancements in large language models (LLMs) have spurred interest in robotic navigation that incorporates complex spatial, mathematical, and conditional constraints from natural language into the planning problem.
Renaissance: Investigating the Pretraining of Vision-Language Encoders
Clayton Fields, Casey Kennington · Nov 11, 2024

Automatic Metrics Coding

To conduct these experiments, we introduce a VL evaluation framework called Renaissance.
LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation
Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang · Nov 7, 2024

Automatic Metrics Coding

The LLM-enhanced CLIP delivers consistent improvements across a wide range of downstream tasks, including linear-probe classification, zero-shot image-text retrieval with both short and long captions (in English and other languages), zero-s
Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning
Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh · Feb 24, 2024

Automatic Metrics Coding

While fine-tuning large language models (LLMs) for specific tasks often yields impressive results, it comes at the cost of memory inefficiency due to back-propagation in gradient-based training.

Cost + Coding Metric Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs