HFEPX Hub

Automatic Metrics + Math (Last 120 Days)

Updated from current HFEPX corpus (Mar 8, 2026). 14 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 14 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: Bankmathbench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 25, 2026.

Papers: 14 Last published: Feb 25, 2026 Global RSS Tag RSS

Automatic MetricsMathLast 120d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (14) Replication-Ready Only (3)

High-Signal Coverage

100.0%

14 / 14 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

3 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Why This Matters For Eval Research

14.3% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 100% of papers in this hub.
Bankmathbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Benchmark Interpretation

Bankmathbench appears in 7.1% of hub papers (1/14); use this cohort for benchmark-matched comparisons.
GSM8K appears in 7.1% of hub papers (1/14); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 64.3% of hub papers (9/14); compare with a secondary metric before ranking methods.
cost is reported in 28.6% of hub papers (4/14); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (14.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (28.6% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (92.9% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (7.1% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (42.9% vs 35% target).

Strengths

Agentic evaluation appears in 78.6% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.1% coverage).
LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (Bankmathbench vs GSM8K) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries (Expanded)

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: Bankmathbench Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Highest protocol score with explicit human/eval signal plus LiveCodeBench.

Strongest benchmark reference

Surgical Post-Training: Cutting Errors, Keeping Knowledge

Reported benchmark with accuracy gives a fast comparison anchor.

Strongest recent paper

BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenari…

Useful for current practice scanning; published Feb 19, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Feb 25, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: LiveCodeBench · Metric: Accuracy
Surgical Post-Training: Cutting Errors, Keeping Knowledge
Mar 2, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios
Feb 19, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Bankmathbench · Metric: Accuracy
Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Feb 20, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Llm As Judge, Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Dec 3, 2025 · Citations: 0 · Score: 5.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: MATH 500 · Metric: Cost
GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered
Mar 2, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Cost

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences Feb 25, 2026	Yes Pairwise Preference	Automatic Metrics	LiveCodeBench , Mathbench	Accuracy	Not Reported
Surgical Post-Training: Cutting Errors, Keeping Knowledge Mar 2, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy	Not Reported
BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios Feb 19, 2026	No Not Reported	Automatic Metrics	Bankmathbench	Accuracy	Not Reported
Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards Feb 20, 2026	Yes Not Reported	Llm As Judge , Automatic Metrics	Not Reported	Accuracy , Win rate	Not Reported
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs Dec 3, 2025	No Not Reported	Automatic Metrics	MATH 500 , GSM8K	Cost	Not Reported
GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered Mar 2, 2026	No Not Reported	Automatic Metrics	Not Reported	Cost	Not Reported
Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance Feb 27, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching Feb 26, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy , Latency	Not Reported
GATES: Self-Distillation under Privileged Context with Consensus Gating Feb 24, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning Feb 26, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space Jan 18, 2026	No Not Reported	Automatic Metrics	MATH	Not Reported	Not Reported
Replaying pre-training data improves fine-tuning Mar 5, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	Duel-Evolve: Reward-Free Test-Time Scaling via LLM…	Surgical Post-Training: Cutting Errors, Keeping Kno…	BankMathBench: A Benchmark for Numerical Reasoning…
Human Feedback	Pairwise Preference	Pairwise Preference	Not reported
Evaluation Modes	Automatic Metrics	Automatic Metrics	Automatic Metrics
Benchmarks	LiveCodeBench, Mathbench	Not reported	Bankmathbench
Metrics	Accuracy	Accuracy	Accuracy
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Unknown
Annotation Unit	Pairwise	Ranking	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (2)

Evaluation Modes

Automatic Metrics (14)
Llm As Judge (1)

Top Benchmarks

Bankmathbench (1)
GSM8K (1)
LiveCodeBench (1)
Longmemeval (1)

Top Metrics

Accuracy (9)
Cost (4)
Agreement (1)
Coherence (1)

Rater Population Mix

Domain Experts (1)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 21.4% · benchmarks 28.6% · metrics 92.9% · quality controls 0.0%.

Top Papers

Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
Surgical Post-Training: Cutting Errors, Keeping Knowledge
Wenye Lin, Kai Han · Mar 2, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct…
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying · Dec 3, 2025 · Citations: 0

Automatic Metrics Long Horizon

Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBenchV2 and SCBench), TRIM-KV consistently outperforms…
BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios
Yunseung Lee, Subin Kim, Youngjun Kwak, Jaegul Choo · Feb 19, 2026 · Citations: 0

Automatic Metrics Long Horizon

However, such errors have rarely been captured by existing benchmarks.
Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama · Feb 20, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics

Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs).
GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered
Jiale Lao, Immanuel Trummer · Mar 2, 2026 · Citations: 0

Automatic Metrics Multi Agent

As a proof of concept, we present GenDB, an LLM-powered agentic system that generates instance-optimized and customized query execution code tailored to specific data, workloads, and hardware resources.
Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance
Yanwei Ren, Haotian Zhang, Likang Xiao, Xikai Zhang, Jiaxing Huang · Feb 27, 2026 · Citations: 0

Automatic Metrics Long Horizon

To address these issues, we propose SCOPE (Step-wise Correction for On-Policy Exploration), a novel framework that utilizes Process Reward Models to pinpoint the first erroneous step in suboptimal rollouts and applies fine-grained,…
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
GATES: Self-Distillation under Privileged Context with Consensus Gating
Alex Stein, Furong Huang, Tom Goldstein · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

Held-out in-domain accuracy under asymmetric evaluation improves from 46.0\% to 62.0\%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2\% to 35.4\%.
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026 · Citations: 0

Automatic Metrics Multi Agent

We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining.
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026 · Citations: 0

Automatic Metrics Long Horizon

Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
Replaying pre-training data improves fine-tuning
Suhas Kotha, Percy Liang · Mar 5, 2026 · Citations: 0

Automatic Metrics Web Browsing

We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by 4.5\% and Basque question-answering accuracy by 2\%.
ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution.
Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng · Feb 22, 2026 · Citations: 0

Automatic Metrics Long Horizon

Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote