Skip to content
← Back to explorer

HFEPX Benchmark Hub

AIME Benchmark Papers

Updated from current HFEPX corpus (Mar 8, 2026). 6 papers are grouped in this benchmark page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 6 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics, Human Eval. Frequently cited benchmark: AIME. Common metric signal: pass@1. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 21, 2026.

Papers: 6 Last published: Feb 21, 2026 Global RSS

Researcher Quick Triage

Use this page for benchmark-matched method comparisons and eval protocol selection. Quality band: Developing .

High-Signal Coverage

100.0%

6 / 6 sampled papers are not low-signal flagged.

Replication-Ready Set

1

Papers with explicit benchmark + metric + eval mode fields.

Quality Controls

0.0%

0 papers report calibration/adjudication/IAA controls.

  • 2 papers explicitly name benchmark datasets in the sampled set.
  • 1 papers report at least one metric term in metadata extraction.
  • Start with the ranked shortlist below before reading all papers.

Primary action: Use this page to map benchmark mentions first; wait for stronger metric/QC coverage before strict comparisons.

Why This Matters (Expanded)

Why This Matters For Eval Research

  • 33.3% of papers report explicit human-feedback signals, led by critique/edit feedback.
  • automatic metrics appears in 16.7% of papers in this hub.
  • AIME is a recurring benchmark anchor for cross-paper comparisons in this page.
Protocol Notes (Expanded)

Protocol Takeaways

  • Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
  • Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
  • Stratify by benchmark (AIME vs CodeContests) before comparing methods.

Benchmark Interpretation

  • AIME appears in 100% of hub papers (6/6); use this cohort for benchmark-matched comparisons.
  • CodeContests appears in 16.7% of hub papers (1/6); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • pass@1 is reported in 33.3% of hub papers (2/6); compare with a secondary metric before ranking methods.

Start Here (Benchmark-Matched First 6)

Ranked by protocol completeness so you can quickly find papers suitable for comparison studies.

Protocol Matrix (Top 10)

Compare protocol ingredients quickly before deep-reading full papers.

Paper Eval Modes Human Feedback Metrics Quality Controls
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Jun 3, 2025

Automatic Metrics Critique Edit Pass@1 Not reported
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models

Feb 21, 2026

Human Eval Pairwise Preference Not reported Not reported
$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners

Mar 4, 2026

Not reported Not reported Not reported Not reported
Tool Verification for Test-Time Reinforcement Learning

Mar 2, 2026

Not reported Not reported Not reported Not reported
CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

Mar 1, 2026

Not reported Not reported Not reported Not reported
MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

Sep 29, 2025

Not reported Not reported Not reported Not reported
Researcher Workflow (Detailed)

Checklist

  • Moderate: Papers with explicit human feedback

    Coverage is usable but incomplete (33.3% vs 45% target).

  • Gap: Papers reporting quality controls

    Coverage is a replication risk (0% vs 30% target).

  • Strong: Papers naming benchmarks/datasets

    Coverage is strong (100% vs 35% target).

  • Moderate: Papers naming evaluation metrics

    Coverage is usable but incomplete (33.3% vs 35% target).

  • Gap: Papers with known rater population

    Coverage is a replication risk (0% vs 35% target).

  • Gap: Papers with known annotation unit

    Coverage is a replication risk (0% vs 35% target).

Strengths

  • This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

  • Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (0% coverage).
  • Annotation unit is under-specified (0% coverage).

Suggested Next Analyses

  • Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
  • Stratify by benchmark (AIME vs CodeContests) before comparing methods.

Recommended Queries

Known Limitations
  • Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (0% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Research Utility Snapshot (Detailed)

Evaluation Modes

  • Automatic Metrics (1)
  • Human Eval (1)

Human Feedback Mix

  • Critique Edit (1)
  • Pairwise Preference (1)

Top Benchmarks

  • AIME (6)
  • CodeContests (1)
  • Correctbench (1)
  • Cruxeval (1)

Top Metrics

  • Pass@1 (2)

Top Papers On This Benchmark

Related Benchmark Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.