Benchmark Hub

MATH Benchmark Papers (Last 300 Days)

Updated from current HFEPX corpus (Feb 27, 2026). 19 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequently cited benchmark: MATH. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 24, 2026.

Papers: 19 Last published: Feb 24, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 19 papers for MATH Benchmark Papers (Last 300 Days). Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on MATH, MATH-500 and metric focus on accuracy, latency. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

10.5% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters , Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale , Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning , Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
automatic metrics appears in 94.7% of papers in this hub.

Evidence: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning , Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards , RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models , Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset
MATH is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning , Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards , RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models , Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning , Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards , RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models , Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: Cognitive networks reconstruct mindsets about STEM subjects and educational contexts in almost 1000 high-schoolers, University students and LLM-based digital twins , Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters , Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning , Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Stratify by benchmark (MATH vs MATH-500) before comparing methods.

Evidence: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning , Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards , RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models , Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset

Benchmark Interpretation

MATH appears in 100% of hub papers (19/19); use this cohort for benchmark-matched comparisons.
MATH-500 appears in 10.5% of hub papers (2/19); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 52.6% of hub papers (10/19); compare with a secondary metric before ranking methods.
latency is reported in 10.5% of hub papers (2/19); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (10.5% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (100% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (68.4% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (10.5% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (10.5% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (10.5% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (100% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (68.4% vs 35% target).

Papers with known rater population

Coverage is a replication risk (10.5% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (10.5% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10.5% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: MATH - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=0, left_only=18, right_only=1

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

MATH

Coverage: 19 papers (100%)

19 papers (100%) mention MATH.

Examples: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning , Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards , RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

Benchmark Brief

MATH-500

Coverage: 2 papers (10.5%)

2 papers (10.5%) mention MATH-500.

Examples: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning , Spurious Rewards: Rethinking Training Signals in RLVR

Benchmark Brief

MMLU

Coverage: 2 papers (10.5%)

2 papers (10.5%) mention MMLU.

Examples: Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale , Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Metric Brief

accuracy

Coverage: 10 papers (52.6%)

10 papers (52.6%) mention accuracy.

Metric Brief

latency

Coverage: 2 papers (10.5%)

2 papers (10.5%) mention latency.

Examples: Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters , CDLM: Consistency Diffusion Language Models For Faster Sampling

Metric Brief

pass@1

Coverage: 2 papers (10.5%)

2 papers (10.5%) mention pass@1.

Examples: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning , Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning , Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards , RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers On This Benchmark

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang · Feb 24, 2026

Automatic Metrics

Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.
Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama · Feb 20, 2026

Automatic Metrics

Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs).
RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
Yunseok Han, Yejoon Lee, Jaeyoung Do · Feb 19, 2026

Automatic Metrics

To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions.
Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset
Zhuqian Zhou, Kirk Vanacore, Bakhtawar Ahtisham, Jinsook Lee, Doug Pietrzak · Feb 18, 2026

Automatic Metrics

To address this challenge, we investigate the "numeric ambiguity" problem and introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, created through a human-in-the-loop LLM workflow that audits upstr
From Growing to Looping: A Unified View of Iterative Computation in LLMs
Ferdinand Kapl, Emmanouil Angelis, Kaitlin Maile, Johannes von Oswald, Stefan Bauer · Feb 18, 2026

Automatic Metrics

Looping, reusing a block of layers across depth, and depth growing, training shallow-to-deep models by duplicating middle layers, have both been linked to stronger reasoning, but their relationship remains unclear.
Learning to Learn from Language Feedback with Social Meta-Learning
Jonathan Cook, Diego Antognini, Martin Klissarov, Claudiu Musat, Edward Grefenstette · Feb 18, 2026

Automatic Metrics

They are rarely proactive in soliciting this feedback, even when faced with ambiguity, which can make their dialogues feel static, one-sided, and lacking the adaptive qualities of human conversation.
Recursive Concept Evolution for Compositional Reasoning in Large Language Models
Sarim Chaudhry · Feb 17, 2026

Automatic Metrics

Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE.
Prescriptive Scaling Reveals the Evolution of Language Model Capabilities
Hanlin Zhang, Jikai Jin, Vasilis Syrgkanis, Sham Kakade · Feb 17, 2026

Automatic Metrics

Using large scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs, via
Cognitive networks reconstruct mindsets about STEM subjects and educational contexts in almost 1000 high-schoolers, University students and LLM-based digital twins
Francesco Gariboldi, Emma Franchino, Edith Haim, Gianluca Lattanzi, Alessandro Grecucci · Feb 16, 2026

Automatic Metrics

Human networks show greater overlapping between mathematics and anxiety than GPT-oss.
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao · Feb 11, 2026

Simulation Env Tool Use

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
LLMs Know More About Numbers than They Can Say
Fengting Yuchi, Li Du, Jason Eisner · Feb 8, 2026

Automatic Metrics

Although state-of-the-art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: "Which is larger, $5.7 \times 10^2$ or $580$?" This raises a fundamental question: Do LLMs even know how big
Proof-RM: A Scalable and Generalizable Reward Model for Math Proof
Haotong Yang, Zitong Wang, Shijia Kang, Siqi Yang, Wenkai Yu · Feb 2, 2026

Automatic Metrics

In this work, we design a *scalable* data-construction pipeline that, with minimal human effort, leverages LLMs to generate a large quantity of high-quality ``**question-proof-check**'' triplet data.
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026

Automatic Metrics Long Horizon

Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
CDLM: Consistency Diffusion Language Models For Faster Sampling
Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun · Nov 24, 2025

Automatic Metrics

The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.
Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale
David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu · Nov 7, 2025

Automatic Metrics

We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts su
Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti · Sep 18, 2025

Automatic Metrics

Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges.
Spurious Rewards: Rethinking Training Signals in RLVR
Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang · Jun 12, 2025

Automatic Metrics

We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer.
AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking
Silin Gao, Antoine Bosselut, Samy Bengio, Emmanuel Abbe · Jun 9, 2025

Automatic Metrics

Our method, AbstRaL -- which promotes abstract reasoning in LLMs using RL on granular abstraction data -- significantly mitigates performance degradation on recent GSM perturbation benchmarks.
Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models
Cheonbok Park, Jeonghoon Kim, Joosung Lee, Sanghwan Bae, Jaegul Choo · Jun 6, 2025

Automatic Metrics

Reinforcement learning with verifiable reward (RLVR) has been instrumental in eliciting strong reasoning capabilities from large language models (LLMs) via long chains of thought (CoT).

Other Benchmark Hubs

MATH Benchmark Papers (Last 300 Days)

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers On This Benchmark

Other Benchmark Hubs