HFEPX Hub

Math + Long Horizon Papers

Updated from current HFEPX corpus (Feb 27, 2026). 10 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Common annotation unit: Trajectory. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: MATH. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 10 Last published: Feb 26, 2026 Global RSS Tag RSS

MathLong Horizon

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 10 papers for Math + Long Horizon Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on MATH, Amo-Bench and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

10% of papers report explicit human-feedback signals, led by critique/edit feedback.

Evidence: Unlocking Reasoning Capability on Machine Translation in Large Language Models , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning , GATES: Self-Distillation under Privileged Context with Consensus Gating
automatic metrics appears in 90% of papers in this hub.

Evidence: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , GATES: Self-Distillation under Privileged Context with Consensus Gating , Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer , Watermarking LLM Agent Trajectories
MATH is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space , MathScape: Benchmarking Multimodal Large Language Models in Real-World Mathematical Contexts , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning

Protocol Takeaways

Most common quality-control signal is inter-annotator agreement reporting (10% of papers).

Evidence: GATES: Self-Distillation under Privileged Context with Consensus Gating , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning , Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Rater context is mostly unspecified rater pools, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning , GATES: Self-Distillation under Privileged Context with Consensus Gating , Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Stratify by benchmark (MATH vs Amo-Bench) before comparing methods.

Evidence: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning , GATES: Self-Distillation under Privileged Context with Consensus Gating , Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer

Benchmark Interpretation

MATH appears in 20% of hub papers (2/10); use this cohort for benchmark-matched comparisons.
Amo-Bench appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 30% of hub papers (3/10); compare with a secondary metric before ranking methods.
cost is reported in 30% of hub papers (3/10); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (10% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (10% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (40% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (60% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (0% vs 35% target).
Maintain strength on Papers with known annotation unit. Coverage is strong (40% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (10% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (10% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (40% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (60% vs 35% target).

Papers with known rater population

Coverage is a replication risk (0% vs 35% target).

Papers with known annotation unit

Coverage is strong (40% vs 35% target).

Known Limitations

Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: MATH - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=0, left_only=9, right_only=1

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

MATH

Coverage: 2 papers (20%)

2 papers (20%) mention MATH.

Examples: Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space , MathScape: Benchmarking Multimodal Large Language Models in Real-World Mathematical Contexts

Benchmark Brief

Amo-Bench

Coverage: 1 papers (10%)

1 papers (10%) mention Amo-Bench.

Examples: What If We Allocate Test-Time Compute Adaptively?

Benchmark Brief

Bankmathbench

Coverage: 1 papers (10%)

1 papers (10%) mention Bankmathbench.

Examples: BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios

Metric Brief

accuracy

Coverage: 3 papers (30%)

3 papers (30%) mention accuracy.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , GATES: Self-Distillation under Privileged Context with Consensus Gating , BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios

Metric Brief

cost

Coverage: 3 papers (30%)

3 papers (30%) mention cost.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer , Watermarking LLM Agent Trajectories

Metric Brief

agreement

Coverage: 1 papers (10%)

1 papers (10%) mention agreement.

Examples: GATES: Self-Distillation under Privileged Context with Consensus Gating

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning , GATES: Self-Distillation under Privileged Context with Consensus Gating

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee · Feb 24, 2026 · Citations: 0

Simulation Env Long Horizon

We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution.
GATES: Self-Distillation under Privileged Context with Consensus Gating
Alex Stein, Furong Huang, Tom Goldstein · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

Held-out in-domain accuracy under asymmetric evaluation improves from 46.0\% to 62.0\%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2\% to 35.4\%.
Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng · Feb 22, 2026 · Citations: 0

Automatic Metrics Long Horizon

Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.
Watermarking LLM Agent Trajectories
Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li · Feb 21, 2026 · Citations: 0

Automatic Metrics Long Horizon

LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering.
BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios
Yunseung Lee, Subin Kim, Youngjun Kwak, Jaegul Choo · Feb 19, 2026 · Citations: 0

Automatic Metrics Long Horizon

However, such errors have rarely been captured by existing benchmarks.
Unlocking Reasoning Capability on Machine Translation in Large Language Models
Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio · Feb 16, 2026 · Citations: 0

Critique Edit Automatic Metrics Long Horizon

We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.
What If We Allocate Test-Time Compute Adaptively?
Ahsan Bilal, Ahmed Mohsin, Muhammad Umer, Ali Subhan, Hassan Rizwan · Feb 1, 2026 · Citations: 0

Automatic Metrics Long Horizon

For each problem, the agent runs multiple inference iterations.
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026 · Citations: 0

Automatic Metrics Long Horizon

Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
MathScape: Benchmarking Multimodal Large Language Models in Real-World Mathematical Contexts
Hao Liang, Linzhuang Sun, Minxuan Zhou, Zirong Chen, Meiyi Qiang · Aug 14, 2024 · Citations: 0

Automatic Metrics Long Horizon

While existing benchmarks such as MathVista and MathVerse have advanced the evaluation of multimodal math proficiency, they primarily rely on digitally rendered content and fall short in capturing the complexity of real-world scenarios.

Math + Long Horizon Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs