HFEPX Hub

Math Papers (Last 120 Days)

Updated from current HFEPX corpus (Mar 1, 2026). 16 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 16 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: LiveCodeBench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 11, 2026.

Papers: 16 Last published: Feb 11, 2026 Global RSS Tag RSS

MathLast 120d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (16) Replication-Ready Only (2)

High-Signal Coverage

100.0%

16 / 16 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

2 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters (Expanded)

Why This Matters For Eval Research

50% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 56.3% of papers in this hub.
LiveCodeBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Benchmark Interpretation

LiveCodeBench appears in 12.5% of hub papers (2/16); use this cohort for benchmark-matched comparisons.
AIME appears in 6.3% of hub papers (1/16); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 37.5% of hub papers (6/16); compare with a secondary metric before ranking methods.
cost is reported in 18.8% of hub papers (3/16); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (50% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Strong: Papers naming benchmarks/datasets

Coverage is strong (37.5% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (56.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (12.5% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (18.8% vs 35% target).

Strengths

Strong human-feedback signal (50% of papers).
Most papers provide measurable evaluation context (37.5% benchmarks, 56.3% metrics).
Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (12.5% coverage).
Annotation unit is under-specified (18.8% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (LiveCodeBench vs AIME) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Benchmark Slice: LiveCodeBench Metric Slice: accuracy Recent High-Signal Papers

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Feb 11, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference · Eval: Not reported · Benchmark: LiveCodeBench · Metric: Latency
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Feb 25, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: LiveCodeBench · Metric: Accuracy
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Feb 21, 2026 · Citations: 0 · Score: 6.5

HF: Pairwise Preference · Eval: Human Eval · Benchmark: GSM8K · Metric: Not Reported
BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios
Feb 19, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Bankmathbench · Metric: Accuracy
Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Feb 20, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Llm As Judge, Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale
Nov 7, 2025 · Citations: 0 · Score: 5.5

HF: Pairwise Preference · Eval: Not reported · Benchmark: MMLU · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters Feb 11, 2026	Yes Pairwise Preference	Not Reported	LiveCodeBench , BrowseComp	Latency , Cost	Not Reported
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences Feb 25, 2026	Yes Pairwise Preference	Automatic Metrics	LiveCodeBench , Mathbench	Accuracy	Not Reported
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models Feb 21, 2026	Yes Pairwise Preference	Human Eval	GSM8K , AIME	Not Reported	Not Reported
BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios Feb 19, 2026	No Not Reported	Automatic Metrics	Bankmathbench	Accuracy	Not Reported
Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards Feb 20, 2026	Yes Not Reported	Llm As Judge , Automatic Metrics	Not Reported	Accuracy , Win rate	Not Reported
Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale Nov 7, 2025	Yes Pairwise Preference	Not Reported	MMLU , MMLU Pro	Not Reported	Not Reported
Unlocking Reasoning Capability on Machine Translation in Large Language Models Feb 16, 2026	Yes Critique Edit	Not Reported	Not Reported	Not Reported	Not Reported
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation Feb 12, 2026	Yes Expert Verification	Not Reported	Not Reported	Not Reported	Not Reported
The logic of KM belief update is contained in the logic of AGM belief revision Feb 26, 2026	Yes Critique Edit	Not Reported	Not Reported	Not Reported	Not Reported
Cold-Start Personalization via Training-Free Priors from Structured World Models Feb 16, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching Feb 26, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy , Latency	Not Reported
GATES: Self-Distillation under Privileged Context with Consensus Gating Feb 24, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	Step 3.5 Flash: Open Frontier-Level Intelligence wi…	Duel-Evolve: Reward-Free Test-Time Scaling via LLM…	Think$^{2}$: Grounded Metacognitive Reasoning in La…
Human Feedback	Pairwise Preference	Pairwise Preference	Pairwise Preference
Evaluation Modes	Not reported	Automatic Metrics	Human Eval
Benchmarks	LiveCodeBench, BrowseComp	LiveCodeBench, Mathbench	GSM8K, AIME
Metrics	Latency, Cost	Accuracy	Not reported
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Domain Experts	Unknown	Unknown
Annotation Unit	Unknown	Pairwise	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (5)
Critique Edit (2)
Expert Verification (1)

Evaluation Modes

Automatic Metrics (9)
Human Eval (1)
Llm As Judge (1)

Top Benchmarks

LiveCodeBench (2)
AIME (1)
Bankmathbench (1)
BrowseComp (1)

Top Metrics

Accuracy (6)
Cost (3)
Latency (2)
Agreement (1)

Rater Population Mix

Domain Experts (2)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 56.3% · benchmarks 37.5% · metrics 56.3% · quality controls 0.0%.

Top Papers

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao · Feb 11, 2026 · Citations: 0

Pairwise Preference Tool Use

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026 · Citations: 0

Pairwise Preference Human Eval

We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight…
Unlocking Reasoning Capability on Machine Translation in Large Language Models
Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio · Feb 16, 2026 · Citations: 0

Critique Edit Long Horizon

We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.
BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios
Yunseung Lee, Subin Kim, Youngjun Kwak, Jaegul Choo · Feb 19, 2026 · Citations: 0

Automatic Metrics Long Horizon

However, such errors have rarely been captured by existing benchmarks.
Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale
David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu · Nov 7, 2025 · Citations: 0

Pairwise Preference

We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts…
Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama · Feb 20, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics

Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs).
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
GATES: Self-Distillation under Privileged Context with Consensus Gating
Alex Stein, Furong Huang, Tom Goldstein · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

Held-out in-domain accuracy under asymmetric evaluation improves from 46.0\% to 62.0\%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2\% to 35.4\%.
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang · Feb 12, 2026 · Citations: 0

Expert Verification

Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term…
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026 · Citations: 0

Automatic Metrics Multi Agent

We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining.
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026 · Citations: 0

Automatic Metrics Long Horizon

Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution.
Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng · Feb 22, 2026 · Citations: 0

Automatic Metrics Long Horizon

Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.
The logic of KM belief update is contained in the logic of AGM belief revision
Giacomo Bonanno · Feb 26, 2026 · Citations: 0

Critique Edit

Denoting the latter by \mathcal L_{AGM} and the former by \mathcal L_{KM} we show that every axiom of \mathcal L_{KM} is a theorem of \mathcal L_{AGM}.
Cold-Start Personalization via Training-Free Priors from Structured World Models
Avinandan Bose, Shuyue Stella Li, Faeze Brahman, Pang Wei Koh, Simon Shaolei Du · Feb 16, 2026 · Citations: 0

Pairwise Preference

Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote