HFEPX Benchmark Hub

Reasoning & Math Suite Benchmark Papers + Math

Updated from current HFEPX corpus (Mar 21, 2026). 10 papers are grouped in this benchmark page.

Read Full Context

Updated from current HFEPX corpus (Mar 21, 2026). 10 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics, Human Eval. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: GSM8K. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 4, 2026.

Papers: 10 Last published: Mar 4, 2026 Global RSS

Researcher Quick Triage

Use this page for benchmark-matched method comparisons and eval protocol selection. Quality band: Developing .

High-Signal Coverage

100.0%

10 / 10 sampled papers are not low-signal flagged.

Replication-Ready Set

Papers with explicit benchmark + metric + eval mode fields.

Quality Controls

10.0%

1 papers report calibration/adjudication/IAA controls.

10 papers explicitly name benchmark datasets in the sampled set.
8 papers report at least one metric term in metadata extraction.
Start with the ranked shortlist below before reading all papers.

Primary action: Start with the top 2 benchmark-matched papers, then compare evaluation modes in the protocol matrix.

Why This Matters (Expanded)

Why This Matters For Eval Research

40% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 80% of papers in this hub.
GSM8K is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Most common quality-control signal is rater calibration (10% of papers).
Rater context is mostly unspecified rater pools, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Benchmark Interpretation

GSM8K appears in 70% of hub papers (7/10); use this cohort for benchmark-matched comparisons.
AIME appears in 20% of hub papers (2/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 60% of hub papers (6/10); compare with a secondary metric before ranking methods.
cost is reported in 30% of hub papers (3/10); compare with a secondary metric before ranking methods.

Start Here (Benchmark-Matched First 6)

Ranked by protocol completeness so you can quickly find papers suitable for comparison studies.

$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners
Mar 4, 2026 · Citations: 0 · Score: 8.5

Eval: Automatic Metrics · Metrics: Pass@1
Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought
Mar 19, 2026 · Citations: 0 · Score: 8.5

Eval: Automatic Metrics · Metrics: Accuracy
FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol
Oct 2, 2025 · Citations: 0 · Score: 7.5

Eval: Automatic Metrics · Metrics: Accuracy
Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes
Mar 15, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Accuracy
Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning
Mar 9, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Accuracy
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Feb 21, 2026 · Citations: 0 · Score: 6.5

Eval: Human Eval · Metrics: Not Reported

Protocol Matrix (Top 10)

Compare protocol ingredients quickly before deep-reading full papers.

Paper	Eval Modes	Human Feedback	Metrics	Quality Controls
$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners Mar 4, 2026	Automatic Metrics	Pairwise Preference	Pass@1	Not reported
Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought Mar 19, 2026	Automatic Metrics	Not reported	Accuracy, Calibration error	Calibration
FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol Oct 2, 2025	Automatic Metrics	Pairwise Preference, Critique Edit	Accuracy	Not reported
Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes Mar 15, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning Mar 9, 2026	Automatic Metrics	Not reported	Accuracy, Cost	Not reported
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models Feb 21, 2026	Human Eval	Pairwise Preference	Not reported	Not reported
The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models Jan 21, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs Dec 3, 2025	Automatic Metrics	Not reported	Cost	Not reported
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling Jun 18, 2025	Automatic Metrics	Not reported	Accuracy, Precision	Not reported
Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale Nov 7, 2025	Not reported	Pairwise Preference	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (40% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (10% vs 30% target).
Strong: Papers naming benchmarks/datasets

Coverage is strong (100% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (80% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (0% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (70% vs 35% target).

Strengths

Most papers provide measurable evaluation context (100% benchmarks, 80% metrics).
Agentic evaluation appears in 60% of papers.

Known Gaps

Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Stratify by benchmark (GSM8K vs AIME) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Human Eval Protocols Benchmark Slice: GSM8K Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (8)
Human Eval (1)

Human Feedback Mix

Pairwise Preference (4)
Critique Edit (1)

Top Benchmarks

GSM8K (7)
AIME (2)
MMLU (2)
CodeContests (1)

Top Metrics

Accuracy (6)
Cost (3)
Calibration error (1)
Inference cost (1)

Top Papers On This Benchmark

$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners
Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan · Mar 4, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being…
Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought
Xinghao Zhao · Mar 19, 2026 · Citations: 0

Automatic Metrics

Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive.
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026 · Citations: 0

Pairwise Preference Human Eval

We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight…
FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol
He Zhang, Anzhou Zhang, Jian Dai · Oct 2, 2025 · Citations: 0

Pairwise PreferenceCritique Edit Automatic Metrics

Beyond structured math tasks, FOR-Prompting supports refinement in open-ended and multi-stage tasks: qualitative analysis shows improved exploration, coverage, and specificity, and a blind study of human preferences found that participants…
Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes
Deepon Halder, Raj Dabre · Mar 15, 2026 · Citations: 0

Automatic Metrics

Empirical validation on GPQA and GSM8K benchmarks indicates that Top-b significantly reduces generation entropy and inter-decoding variance while maintaining competitive reasoning accuracy, effectively approximating a self-regulating…
Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning
Juming Xiong, Kevin Guo, Congning Ni, Chao Yan, Katherine Brown · Mar 9, 2026 · Citations: 0

Automatic Metrics

Recent self-consistency-based approaches further improve accuracy but require sampling and aggregating multiple reasoning trajectories, leading to substantial additional computational overhead.
The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao · Jan 21, 2026 · Citations: 0

Automatic Metrics

We demonstrate that effective reasoning can be better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead.
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying · Dec 3, 2025 · Citations: 0

Automatic Metrics

Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBenchV2 and SCBench), TRIM-KV consistently outperforms…
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych · Jun 18, 2025 · Citations: 0

Automatic Metrics

To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation by jointly aligning solution steps to reference solutions and determine…
Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale
David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu · Nov 7, 2025 · Citations: 0

Pairwise Preference

We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts…

Related Benchmark Hubs

AIME Or GSM8K Or MMLU Benchmark Papers Reasoning & Math Suite Benchmark Papers GSM8K + Math Benchmark Papers GSM8K Or MMLU Benchmark Papers Math Papers Reasoning & Math Suite Benchmark Papers + Automatic Metrics GSM8K Benchmark Papers (Last 300 Days) (10) GSM8K Benchmark Papers (Last 365 Days) (10) GSM8K Benchmark Papers (10) DROP Benchmark Papers (Last 30 Days) (14) DROP Benchmark Papers (Last 45 Days) (14) DROP Benchmark Papers (Last 60 Days) (15) DROP Benchmark Papers (Last 75 Days) (15)

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote