HFEPX Hub

CS.AI + Math Papers

Updated from current HFEPX corpus (Feb 27, 2026). 68 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: MATH. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 68 Last published: Feb 26, 2026 Global RSS Tag RSS

Cs.AIMath

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 68 papers for CS.AI + Math Papers. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on MATH, GSM8K and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

11.8% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
automatic metrics appears in 92.6% of papers in this hub.

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring , NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion
MATH is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Protocol Takeaways

Most common quality-control signal is rater calibration (5.9% of papers).

Evidence: Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: Aletheia tackles FirstProof autonomously , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring , NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion

Benchmark Interpretation

MATH appears in 19.1% of hub papers (13/68); use this cohort for benchmark-matched comparisons.
GSM8K appears in 11.8% of hub papers (8/68); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 35.3% of hub papers (24/68); compare with a secondary metric before ranking methods.
cost is reported in 14.7% of hub papers (10/68); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (11.8% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (5.9% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (45.6% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (52.9% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (7.4% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (5.9% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (11.8% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (5.9% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (45.6% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (52.9% vs 35% target).

Papers with known rater population

Coverage is a replication risk (7.4% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (5.9% vs 35% target).

Known Limitations

Only 5.9% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.4% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: MATH - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=0, left_only=1, right_only=63

0 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=0, left_only=63, right_only=4

0 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=0, left_only=4, right_only=1

0 papers use both Simulation Env and Human Eval.

Benchmark Brief

MATH

Coverage: 13 papers (19.1%)

13 papers (19.1%) mention MATH.

Examples: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning , Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards , RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

Benchmark Brief

GSM8K

Coverage: 8 papers (11.8%)

8 papers (11.8%) mention GSM8K.

Examples: Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models

Benchmark Brief

Retrieval

Coverage: 6 papers (8.8%)

6 papers (8.8%) mention Retrieval.

Examples: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

Metric Brief

accuracy

Coverage: 24 papers (35.3%)

24 papers (35.3%) mention accuracy.

Examples: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance

Metric Brief

cost

Coverage: 10 papers (14.7%)

10 papers (14.7%) mention cost.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Sparsity Induction for Accurate Post-Training Pruning of Large Language Models , Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Metric Brief

latency

Coverage: 7 papers (10.3%)

7 papers (10.3%) mention latency.

Examples: Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026 · Citations: 0

Automatic Metrics Multi Agent

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants.
Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu · Feb 26, 2026 · Citations: 0

Automatic Metrics

Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases.
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Usman Anwar, Julianna Piskorz, David D. Baek, David Africa, Jim Weatherall · Feb 26, 2026 · Citations: 0

Automatic Metrics

Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred f
NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion
Hung-Hsuan Chen · Feb 26, 2026 · Citations: 0

Automatic Metrics

On the SlimOrca benchmark, NoRA breaks this linear barrier: NoRA remarkably at rank 64 (PPL 3.89) outperforms LoRA at rank 512 (PPL 3.90), demonstrating superior spectral efficiency.
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song · Feb 26, 2026 · Citations: 0

Automatic Metrics

Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, lea
Sparsity Induction for Accurate Post-Training Pruning of Large Language Models
Minhao Jiang, Zhikai Li, Xuewen Liu, Jing Zhang, Mengjuan Chen · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large language models have demonstrated capabilities in text generation, while their increasing parameter scales present challenges in computational and memory efficiency.
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang · Feb 24, 2026 · Citations: 0

Automatic Metrics

Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.
Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
Charafeddine Mouzouni · Feb 24, 2026 · Citations: 0

Automatic Metrics

We validate across five benchmarks, five models from three families, and both synthetic and real data.
Representation Theorems for Cumulative Propositional Dependence Logics
Juha Kontinen, Arne Meier, Kai Sauerwald · Feb 24, 2026 · Citations: 0

Automatic Metrics

This paper establishes and proves representation theorems for cumulative propositional dependence logic and for cumulative propositional logic with team semantics.
Equitable Evaluation via Elicitation
Elbert Du, Cynthia Dwork, Lunjia Hu, Reid McIlroy-Young, Han Shao · Feb 24, 2026 · Citations: 0

Automatic Metrics

To obtain sufficient training data, we train an LLM to act as synthetic humans.
Aletheia tackles FirstProof autonomously
Tony Feng, Junehyuk Jung, Sang-hyun Kim, Carlo Pagano, Sergei Gukov · Feb 24, 2026 · Citations: 0

Automatic Metrics

We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge.
Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi · Feb 24, 2026 · Citations: 0

Automatic Metrics

Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning.
LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification
Yanrui Wu, Lingling Zhang, Xinyu Zhang, Jiayu Chang, Pengyu Li · Feb 24, 2026 · Citations: 0

Automatic Metrics

Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof.
Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Feb 24, 2026 · Citations: 0

Automatic Metrics

Experiments on mathematical reasoning benchmarks show that GOPO achieves competitive generalization while maintaining stable gradient dynamics and entropy preservation in regimes where clipping-based methods plateau.
Pipeline for Verifying LLM-Generated Mathematical Solutions
Varvara Sazonova, Dmitri Shmelkin, Stanislav Kikot, Vasily Motolygin · Feb 24, 2026 · Citations: 0

Automatic Metrics

We introduce a pipeline for both automatic and interactive verification as a more accurate alternative to only checking the answer which is currently the most popular approach for benchmarks.
Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning
Xu Wan, Yansheng Wang, Wenqi Huang, Mingyang Sun · Feb 24, 2026 · Citations: 0

Automatic Metrics

Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-tr
KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi · Feb 23, 2026 · Citations: 0

Automatic Metrics

Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-sp
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled · Feb 23, 2026 · Citations: 0

Automatic Metrics

In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.
Hyperbolic Busemann Neural Networks
Ziheng Chen, Bernhard Schölkopf, Nicu Sebe · Feb 21, 2026 · Citations: 0

Automatic Metrics

Hyperbolic spaces provide a natural geometry for representing hierarchical and tree-structured data due to their exponential volume growth.
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026 · Citations: 0

Pairwise Preference Human Eval

We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight
Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning
Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin · Feb 20, 2026 · Citations: 0

Automatic Metrics

Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV-cache overhead.
Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama · Feb 20, 2026 · Citations: 0

Automatic Metrics

Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs).
ArXiv-to-Model: A Practical Study of Scientific LM Training
Anuj Gupta · Feb 19, 2026 · Citations: 0

Automatic Metrics

While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented.
RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
Yunseok Han, Yejoon Lee, Jaeyoung Do · Feb 19, 2026 · Citations: 0

Automatic Metrics

To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions.
From Growing to Looping: A Unified View of Iterative Computation in LLMs
Ferdinand Kapl, Emmanouil Angelis, Kaitlin Maile, Johannes von Oswald, Stefan Bauer · Feb 18, 2026 · Citations: 0

Automatic Metrics

Looping, reusing a block of layers across depth, and depth growing, training shallow-to-deep models by duplicating middle layers, have both been linked to stronger reasoning, but their relationship remains unclear.
Learning to Learn from Language Feedback with Social Meta-Learning
Jonathan Cook, Diego Antognini, Martin Klissarov, Claudiu Musat, Edward Grefenstette · Feb 18, 2026 · Citations: 0

Automatic Metrics

They are rarely proactive in soliciting this feedback, even when faced with ambiguity, which can make their dialogues feel static, one-sided, and lacking the adaptive qualities of human conversation.
Recursive Concept Evolution for Compositional Reasoning in Large Language Models
Sarim Chaudhry · Feb 17, 2026 · Citations: 0

Automatic Metrics

Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE.
STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng · Feb 17, 2026 · Citations: 0

Automatic Metrics

Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13% ($ρ_{\mathrm{T}}$=1.0, top-p=1.0) and
RUVA: Personalized Transparent On-Device Graph Reasoning
Gabriele Conte, Alessio Mattiace, Gianni Carmosino, Potito Aghilar, Giovanni Servedio · Feb 17, 2026 · Citations: 0

Automatic Metrics

We propose Ruva, the first "Glass Box" architecture designed for Human-in-the-Loop Memory Curation.
Prescriptive Scaling Reveals the Evolution of Language Model Capabilities
Hanlin Zhang, Jikai Jin, Vasilis Syrgkanis, Sham Kakade · Feb 17, 2026 · Citations: 0

Automatic Metrics

Using large scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs, via
Weight space Detection of Backdoors in LoRA Adapters
David Puertolas Merenciano, Ekaterina Vasyagina, Raghav Dixit, Kevin Zhu, Ruizhe Li · Feb 16, 2026 · Citations: 0

Automatic Metrics

We evaluate the method on 500 LoRA adapters -- 400 clean, and 100 poisoned for Llama-3.2-3B on instruction and reasoning datasets: Alpaca, Dolly, GSM8K, ARC-Challenge, SQuADv2, NaturalQuestions, HumanEval, and GLUE dataset.
Cold-Start Personalization via Training-Free Priors from Structured World Models
Avinandan Bose, Shuyue Stella Li, Faeze Brahman, Pang Wei Koh, Simon Shaolei Du · Feb 16, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available.
Unlocking Reasoning Capability on Machine Translation in Large Language Models
Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio · Feb 16, 2026 · Citations: 0

Critique Edit Automatic Metrics Long Horizon

We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang · Feb 12, 2026 · Citations: 0

Expert Verification Automatic Metrics

On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distil
Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models
Mingyu Cao, Alvaro H. C. Correia, Christos Louizos, Shiwei Liu, Lu Yin · Feb 11, 2026 · Citations: 0

Automatic Metrics

Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao · Feb 11, 2026 · Citations: 0

Pairwise Preference Simulation Env Tool Use

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
Accelerating Scientific Research with Gemini: Case Studies and Common Techniques
David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo · Feb 3, 2026 · Citations: 0

Automatic Metrics

Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer.
Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
Xiang Zheng, Weiqi Zhai, Wei Wang, Boyu Yang, Wenbo Li · Jan 31, 2026 · Citations: 0

Automatic Metrics Multi Agent

Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence.
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026 · Citations: 0

Automatic Metrics Long Horizon

Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
Generating metamers of human scene understanding
Ritik Raina, Abe Leite, Alexandros Graikos, Seoyoung Ahn, Dimitris Samaras · Jan 16, 2026 · Citations: 0

Automatic Metrics

Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene.
CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving
Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng · Jan 5, 2026 · Citations: 0

Automatic Metrics

Motivated by this, we present CogFlow, a novel cognitive-inspired three-stage framework that incorporates a knowledge internalization stage, explicitly simulating the hierarchical flow of human reasoning: perception$\Rightarrow$internalizat
Group Representational Position Encoding
Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan · Dec 8, 2025 · Citations: 0

Automatic Metrics

We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions.
Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale
David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu · Nov 7, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts su
A Proof of Learning Rate Transfer under $μ$P
Soufiane Hayou · Nov 3, 2025 · Citations: 0

Automatic Metrics

We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $μ$P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit.
Slm-mux: Orchestrating small language models for reasoning
Chenyu Wang, Zishen Wan, Hao Kang, Emma Chen, Zhiqiang Xie · Oct 6, 2025 · Citations: 0

Automatic Metrics

Additional experiments show that the core principle of SLM-MUX extends to open-ended generation tasks (e.g., HumanEval) and benefits other model classes, including frontier LLMs and domain-specific fine-tuned SLMs.
From Parameters to Behaviors: Unsupervised Compression of the Policy Space
Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli · Sep 26, 2025 · Citations: 0

Simulation Env

Despite its recent successes, Deep Reinforcement Learning (DRL) is notoriously sample-inefficient.
NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems
Yuyao Wang, Bowen Liu, Jianheng Tang, Nuo Chen, Yuhan Li · Aug 28, 2025 · Citations: 0

Automatic Metrics

However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored.
Diffusion Language Models Know the Answer Before Decoding
Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan · Aug 27, 2025 · Citations: 0

Automatic Metrics

Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality.
Hidden Dynamics of Massive Activations in Transformer Training
Jorge Gallego-Feliciano, S. Aaron McClendon, Juan Morinelli, Stavros Zervoudakis, Antonios Saravanos · Aug 5, 2025 · Citations: 0

Automatic Metrics

We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed, and release our full dataset publicly to support further research.
$\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts
Mert Cemri, Nived Rajaraman, Rishabh Tiwari, Xiaoxuan Liu, Kurt Keutzer · Jun 15, 2025 · Citations: 0

Automatic Metrics

Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs), typically by allocating additional computation for more thorough exploration.
Spurious Rewards: Rethinking Training Signals in RLVR
Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang · Jun 12, 2025 · Citations: 0

Automatic Metrics

We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer.
AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking
Silin Gao, Antoine Bosselut, Samy Bengio, Emmanuel Abbe · Jun 9, 2025 · Citations: 0

Automatic Metrics

Our method, AbstRaL -- which promotes abstract reasoning in LLMs using RL on granular abstraction data -- significantly mitigates performance degradation on recent GSM perturbation benchmarks.
Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models
Cheonbok Park, Jeonghoon Kim, Joosung Lee, Sanghwan Bae, Jaegul Choo · Jun 6, 2025 · Citations: 0

Automatic Metrics

Reinforcement learning with verifiable reward (RLVR) has been instrumental in eliciting strong reasoning capabilities from large language models (LLMs) via long chains of thought (CoT).
"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation
Amin Seffo, Aladin Djuhera, Masataro Asai, Holger Boche · Jun 4, 2025 · Citations: 0

Simulation Env Web Browsing

Recent advancements in large language models (LLMs) have spurred interest in robotic navigation that incorporates complex spatial, mathematical, and conditional constraints from natural language into the planning problem.
On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu · May 23, 2025 · Citations: 0

Automatic Metrics

On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO.
Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach
Oren Sultan, Eitan Stern, Dafna Shahaf · May 20, 2025 · Citations: 0

Automatic Metrics

Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning, such as mathematical proof generation.
BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs
Junxiao Yang, Jinzhe Tu, Haoran Liu, Xiaoce Wang, Chujie Zheng · May 18, 2025 · Citations: 0

Automatic Metrics

Recent advances in Large Reasoning Models (LRMs) have shown impressive capabilities in mathematical and logical reasoning.
Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao · Apr 7, 2025 · Citations: 0

Red Team Automatic Metrics

We organize existing benchmarks and datasets into coherent categories reflecting the evolving landscape of multi-turn dialogue evaluation, and review a broad spectrum of enhancement methodologies, including model-centric strategies (in-cont

CS.AI + Math Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs