Skip to content
← Back to explorer

HFEPX Hub

Long Horizon + Math (Last 60 Days)

Updated from current HFEPX corpus (Apr 12, 2026). 19 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 19 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: GSM8K. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 19, 2026.

Papers: 19 Last published: Mar 19, 2026 Global RSS Tag RSS
Long HorizonMathLast 60d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Medium .

High-Signal Coverage

100.0%

19 / 19 sampled papers are not low-signal flagged.

Replication-Ready Set

6

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

0

Papers containing both `human_eval` and `llm_as_judge`.

  • 6 papers are replication-ready (benchmark + metric + explicit evaluation mode).
  • 0 papers support judge-vs-human agreement analysis.
  • 1 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

  • 10.5% of papers report explicit human-feedback signals, led by critique/edit feedback.
  • automatic metrics appears in 84.2% of papers in this hub.
  • GSM8K is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

  • Most common quality-control signal is rater calibration (5.3% of papers).
  • Rater context is mostly unspecified rater pools, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
  • Stratify by benchmark (GSM8K vs Bankmathbench) before comparing methods.

Benchmark Interpretation

  • GSM8K appears in 26.3% of hub papers (5/19); use this cohort for benchmark-matched comparisons.
  • Bankmathbench appears in 5.3% of hub papers (1/19); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • accuracy is reported in 63.2% of hub papers (12/19); compare with a secondary metric before ranking methods.
  • cost is reported in 31.6% of hub papers (6/19); compare with a secondary metric before ranking methods.
Researcher Checklist (Expanded)

Researcher Checklist

  • Gap: Papers with explicit human feedback

    Coverage is a replication risk (10.5% vs 45% target).

  • Gap: Papers reporting quality controls

    Coverage is a replication risk (5.3% vs 30% target).

  • Strong: Papers naming benchmarks/datasets

    Coverage is strong (36.8% vs 35% target).

  • Strong: Papers naming evaluation metrics

    Coverage is strong (84.2% vs 35% target).

  • Gap: Papers with known rater population

    Coverage is a replication risk (0% vs 35% target).

  • Strong: Papers with known annotation unit

    Coverage is strong (57.9% vs 35% target).

Strengths

  • Most papers provide measurable evaluation context (36.8% benchmarks, 84.2% metrics).
  • Agentic evaluation appears in 100% of papers.

Known Gaps

  • Only 5.3% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (0% coverage).

Suggested Next Analyses

  • Stratify by benchmark (GSM8K vs Bankmathbench) before comparing methods.
  • Track metric sensitivity by reporting both accuracy and cost.
  • Add inter-annotator agreement checks when reproducing these protocols.
Recommended Queries (Expanded)

Recommended Queries

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper HF Signal Eval Modes Benchmarks Metrics QC
Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Mar 19, 2026

No
Not Reported
Automatic Metrics GSM8K Accuracy , Calibration error Calibration
RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models

Mar 27, 2026

Yes Not Reported GSM8K Not Reported Not Reported
Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents

Apr 9, 2026

No
Not Reported
Automatic Metrics GSM8K Accuracy Not Reported
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

Apr 1, 2026

No
Not Reported
Automatic Metrics MATH 500 , GSM8K Pass@1 , Cost Not Reported
Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes

Mar 15, 2026

No
Not Reported
Automatic Metrics GSM8K , GPQA Accuracy Not Reported
Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning

Mar 9, 2026

No
Not Reported
Automatic Metrics MMLU Accuracy , Cost Not Reported
BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios

Feb 19, 2026

No
Not Reported
Automatic Metrics Bankmathbench Accuracy Not Reported
Unlocking Reasoning Capability on Machine Translation in Large Language Models

Feb 16, 2026

Yes Not Reported Not Reported Not Reported Not Reported
SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning

Apr 8, 2026

No
Not Reported
Automatic Metrics Not Reported Accuracy Not Reported
TRIMS: Trajectory-Ranked Instruction Masked Supervision for Diffusion Language Models

Apr 1, 2026

No
Not Reported
Automatic Metrics Not Reported Accuracy , Latency Not Reported
Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning

Apr 8, 2026

No
Not Reported
Automatic Metrics Not Reported Accuracy Not Reported
Hierarchical Chain-of-Thought Prompting: Enhancing LLM Reasoning Performance and Efficiency

Mar 31, 2026

No
Not Reported
Automatic Metrics Not Reported Accuracy , Coherence Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal Entropy trajectory shape predicts LLM reasoning rel… RASPRef: Retrieval-Augmented Self-Supervised Prompt… Don't Overthink It: Inter-Rollout Action Agreement…
Human Feedback Not reportedCritique EditNot reported
Evaluation Modes Automatic MetricsNot reportedAutomatic Metrics
Benchmarks GSM8KGSM8KGSM8K
Metrics Accuracy, Calibration errorNot reportedAccuracy
Quality Controls CalibrationNot reportedNot reported
Rater Population UnknownUnknownUnknown
Annotation Unit ScalarTrajectoryTrajectory
Suggested Reading Order (Extended)

This section is intentionally expanded only when needed; use “Start Here” above for a faster pass.

Suggested Reading Order

  1. Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents

    Start here for detailed protocol reporting and quality-control evidence. Signals: automatic metrics. Focus: GSM8K / accuracy. Abstract: Inference-time compute scaling has emerged as a powerful technique for improving.

  2. Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning

    Start here for detailed protocol reporting and quality-control evidence. Signals: automatic metrics. Focus: accuracy. Abstract: Multi-step Chain-of-Thought (CoT) has significantly advanced the mathematical reasoning capabilities of LLMs by.

  3. SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning

    Start here for detailed protocol reporting and quality-control evidence. Signals: automatic metrics. Focus: accuracy. Abstract: Process supervision has emerged as a promising approach for enhancing LLM reasoning, yet.

  4. RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models

    Include a human-eval paper to calibrate against judge-based evaluation settings. Signals: critique/edit feedback. Focus: GSM8K. Abstract: To address this limitation, we introduce Retrieval-Augmented Self-Supervised Prompt Refinement (RASPRef), a.

  5. Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

    Include a human-eval paper to calibrate against judge-based evaluation settings. Signals: automatic metrics. Focus: GSM8K / accuracy. Abstract: Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply.

  6. S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

    Adds automatic metrics for broader protocol coverage within this hub. Signals: automatic metrics. Focus: MATH-500 / pass@1. Abstract: Using roughly 48 execution-verified HumanEval training solutions, tuning a single.

  7. Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes

    Adds automatic metrics for broader protocol coverage within this hub. Signals: automatic metrics. Focus: GSM8K / accuracy. Abstract: Probabilistic language generators are theoretically modeled as discrete stochastic processes,.

  8. Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning

    Adds automatic metrics for broader protocol coverage within this hub. Signals: automatic metrics. Focus: MMLU / accuracy. Abstract: Large language models (LLMs) achieve strong reasoning performance through chain-of-thought.

Known Limitations

Known Limitations

  • Only 5.3% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (0% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Research Utility Snapshot (Detailed)

Research Utility Snapshot

Human Feedback Mix

  • Critique Edit (2)

Evaluation Modes

  • Automatic Metrics (16)
  • Simulation Env (1)

Top Benchmarks

  • GSM8K (5)
  • Bankmathbench (1)
  • GPQA (1)
  • HumanEval+ (1)

Top Metrics

  • Accuracy (12)
  • Cost (6)
  • Agreement (2)
  • Coherence (2)

Rater Population Mix

Quality Controls

  • Calibration (1)
Coverage diagnostics (sample-based): human-feedback 10.5% · benchmarks 36.8% · metrics 84.2% · quality controls 5.3%.

Top Papers

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.