HFEPX Hub

Automatic Metrics + Coding + Math Papers

Updated from current HFEPX corpus (Feb 27, 2026). 34 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: GSM8K. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 34 Last published: Feb 26, 2026 Global RSS Tag RSS

Automatic MetricsCodingMath

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 34 papers for Automatic Metrics + Coding + Math Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on GSM8K, MATH and metric focus on accuracy, latency. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

5.9% of papers report explicit human-feedback signals, led by expert verification.

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
automatic metrics appears in 100% of papers in this hub.

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
GSM8K is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , SPQ: An Ensemble Technique for Large Language Model Compression , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Protocol Takeaways

Most common quality-control signal is rater calibration (2.9% of papers).

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Stratify by benchmark (GSM8K vs MATH) before comparing methods.

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Benchmark Interpretation

GSM8K appears in 17.6% of hub papers (6/34); use this cohort for benchmark-matched comparisons.
MATH appears in 17.6% of hub papers (6/34); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 41.2% of hub papers (14/34); compare with a secondary metric before ranking methods.
latency is reported in 17.6% of hub papers (6/34); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (5.9% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (2.9% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (55.9% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (64.7% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (5.9% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (11.8% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (5.9% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (2.9% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (55.9% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (64.7% vs 35% target).

Papers with known rater population

Coverage is a replication risk (5.9% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (11.8% vs 35% target).

Known Limitations

Only 2.9% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (5.9% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: GSM8K - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=1, left_only=33, right_only=0

1 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

GSM8K

Coverage: 6 papers (17.6%)

6 papers (17.6%) mention GSM8K.

Examples: InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , SPQ: An Ensemble Technique for Large Language Model Compression , Scaling Beyond Masked Diffusion Language Models

Benchmark Brief

MATH

Coverage: 6 papers (17.6%)

6 papers (17.6%) mention MATH.

Examples: RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models , Recursive Concept Evolution for Compositional Reasoning in Large Language Models , LLMs Know More About Numbers than They Can Say

Benchmark Brief

MMLU

Coverage: 3 papers (8.8%)

3 papers (8.8%) mention MMLU.

Examples: ID-LoRA: Efficient Low-Rank Adaptation Inspired by Matrix Interpolative Decomposition , Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation , Diffusion Language Models Know the Answer Before Decoding

Metric Brief

accuracy

Coverage: 14 papers (41.2%)

14 papers (41.2%) mention accuracy.

Examples: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

Metric Brief

latency

Coverage: 6 papers (17.6%)

6 papers (17.6%) mention latency.

Examples: Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

Metric Brief

cost

Coverage: 5 papers (14.7%)

5 papers (14.7%) mention cost.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training , Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026 · Citations: 0

Automatic Metrics Multi Agent

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants.
Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu · Feb 26, 2026 · Citations: 0

Automatic Metrics

Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases.
InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross · Feb 26, 2026 · Citations: 0

Automatic Metrics

Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Usman Anwar, Julianna Piskorz, David D. Baek, David Africa, Jim Weatherall · Feb 26, 2026 · Citations: 0

Automatic Metrics

Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred f
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song · Feb 26, 2026 · Citations: 0

Automatic Metrics

Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, lea
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi · Feb 24, 2026 · Citations: 0

Automatic Metrics

Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning.
LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification
Yanrui Wu, Lingling Zhang, Xinyu Zhang, Jiayu Chang, Pengyu Li · Feb 24, 2026 · Citations: 0

Automatic Metrics

Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof.
ID-LoRA: Efficient Low-Rank Adaptation Inspired by Matrix Interpolative Decomposition
Xindian Ma, Rundong Kong, Peng Zhang, Ruoxiang Huang, Yongyu Jiang · Feb 24, 2026 · Citations: 0

Automatic Metrics

We evaluate ID-LoRA on five diverse benchmarks: Mathematical Reasoning, Code Generation, MMLU, CommonsenseQA, and Safety Alignment.
Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng · Feb 22, 2026 · Citations: 0

Automatic Metrics Long Horizon

Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.
Hyperbolic Busemann Neural Networks
Ziheng Chen, Bernhard Schölkopf, Nicu Sebe · Feb 21, 2026 · Citations: 0

Automatic Metrics

Hyperbolic spaces provide a natural geometry for representing hierarchical and tree-structured data due to their exponential volume growth.
Watermarking LLM Agent Trajectories
Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li · Feb 21, 2026 · Citations: 0

Automatic Metrics Long Horizon

LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering.
SPQ: An Ensemble Technique for Large Language Model Compression
Jiamin Yao, Eren Gultepe · Feb 20, 2026 · Citations: 0

Automatic MetricsSimulation Env

Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K.
VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean
Yutong Xin, Qiaochu Chen, Greg Durrett, Işil Dillig · Feb 20, 2026 · Citations: 0

Automatic Metrics

However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries.
Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning
Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin · Feb 20, 2026 · Citations: 0

Automatic Metrics

Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV-cache overhead.
RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
Yunseok Han, Yejoon Lee, Jaeyoung Do · Feb 19, 2026 · Citations: 0

Automatic Metrics

To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions.
Recursive Concept Evolution for Compositional Reasoning in Large Language Models
Sarim Chaudhry · Feb 17, 2026 · Citations: 0

Automatic Metrics

Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE.
Scaling Beyond Masked Diffusion Language Models
Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu · Feb 16, 2026 · Citations: 0

Automatic Metrics

Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks.
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang · Feb 12, 2026 · Citations: 0

Expert Verification Automatic Metrics

On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distil
Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models
Mingyu Cao, Alvaro H. C. Correia, Christos Louizos, Shiwei Liu, Lu Yin · Feb 11, 2026 · Citations: 0

Automatic Metrics

Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and
LLMs Know More About Numbers than They Can Say
Fengting Yuchi, Li Du, Jason Eisner · Feb 8, 2026 · Citations: 0

Automatic Metrics

Although state-of-the-art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: "Which is larger, $5.7 \times 10^2$ or $580$?" This raises a fundamental question: Do LLMs even know how big
Accelerating Scientific Research with Gemini: Case Studies and Common Techniques
David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo · Feb 3, 2026 · Citations: 0

Automatic Metrics

Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer.
CDLM: Consistency Diffusion Language Models For Faster Sampling
Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun · Nov 24, 2025 · Citations: 0

Automatic Metrics

The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.
FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao · Oct 10, 2025 · Citations: 0

Automatic Metrics

We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings.
Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti · Sep 18, 2025 · Citations: 0

Automatic Metrics

Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges.
ATTS: Asynchronous Test-Time Scaling via Conformal Prediction
Jing Xiong, Qiujiang Chen, Fanghua Ye, Zhongwei Wan, Chuanyang Zheng · Sep 18, 2025 · Citations: 0

Automatic Metrics

Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency.
Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR
Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang · Sep 2, 2025 · Citations: 0

Automatic Metrics

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming.
NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems
Yuyao Wang, Bowen Liu, Jianheng Tang, Nuo Chen, Yuhan Li · Aug 28, 2025 · Citations: 0

Automatic Metrics

However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored.
Diffusion Language Models Know the Answer Before Decoding
Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan · Aug 27, 2025 · Citations: 0

Automatic Metrics

Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality.
Hidden Dynamics of Massive Activations in Transformer Training
Jorge Gallego-Feliciano, S. Aaron McClendon, Juan Morinelli, Stavros Zervoudakis, Antonios Saravanos · Aug 5, 2025 · Citations: 0

Automatic Metrics

We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed, and release our full dataset publicly to support further research.
Spurious Rewards: Rethinking Training Signals in RLVR
Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang · Jun 12, 2025 · Citations: 0

Automatic Metrics

We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer.
Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs
Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh · Jun 2, 2025 · Citations: 0

Automatic Metrics

Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation.
MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task
Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Xin Xu · Feb 17, 2025 · Citations: 0

Automatic Metrics

Through comprehensive experiments on multiple mathematical reasoning datasets, including MathInstruct, MetaMathQA and etc., we demonstrate that models trained on MathFimer-expanded data consistently outperform their counterparts trained on

Automatic Metrics + Coding + Math Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs