HFEPX Benchmark Hub

MMLU Or MATH-500 Or SWE-bench Benchmark Papers

Updated from current HFEPX corpus (Apr 27, 2026). 60 papers are grouped in this benchmark page.

Read Full Context

Updated from current HFEPX corpus (Apr 27, 2026). 60 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: MMLU. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 28, 2026.

Papers: 60 Last published: Mar 28, 2026 Global RSS

Researcher Quick Triage

Use this page for benchmark-matched method comparisons and eval protocol selection. Quality band: High .

High-Signal Coverage

100.0%

60 / 60 sampled papers are not low-signal flagged.

Replication-Ready Set

Papers with explicit benchmark + metric + eval mode fields.

Quality Controls

0.0%

0 papers report calibration/adjudication/IAA controls.

17 papers explicitly name benchmark datasets in the sampled set.
13 papers report at least one metric term in metadata extraction.
Start with the ranked shortlist below before reading all papers.

Primary action: Start with the top 2 benchmark-matched papers, then compare evaluation modes in the protocol matrix.

Why This Matters (Expanded)

Why This Matters For Eval Research

10% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 23.3% of papers in this hub.
MMLU is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Benchmark Interpretation

MMLU appears in 51.7% of hub papers (31/60); use this cohort for benchmark-matched comparisons.
MATH-500 appears in 25% of hub papers (15/60); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 33.3% of hub papers (20/60); compare with a secondary metric before ranking methods.
cost is reported in 28.3% of hub papers (17/60); compare with a secondary metric before ranking methods.

Start Here (Benchmark-Matched First 6)

Ranked by protocol completeness so you can quickly find papers suitable for comparison studies.

PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Mar 28, 2026 · Citations: 0 · Score: 8.0

Eval: Llm As Judge, Automatic Metrics · Metrics: Accuracy
$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners
Mar 4, 2026 · Citations: 0 · Score: 8.0

Eval: Automatic Metrics · Metrics: Pass@1
How Reliable is Language Model Micro-Benchmarking?
Oct 9, 2025 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Accuracy
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Apr 1, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Pass@1
Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning
Mar 9, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy
Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis
Mar 23, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Compare protocol ingredients quickly before deep-reading full papers.

Paper	Eval Modes	Human Feedback	Metrics	Quality Controls
PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering Mar 28, 2026	Llm As Judge, Automatic Metrics	Expert Verification	Accuracy, Relevance	Not reported
$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners Mar 4, 2026	Automatic Metrics	Pairwise Preference	Pass@1	Not reported
How Reliable is Language Model Micro-Benchmarking? Oct 9, 2025	Automatic Metrics	Pairwise Preference	Accuracy, Cost	Not reported
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models Apr 1, 2026	Automatic Metrics	Not reported	Pass@1, Cost	Not reported
Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning Mar 9, 2026	Automatic Metrics	Not reported	Accuracy, Cost	Not reported
Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis Mar 23, 2026	Automatic Metrics	Not reported	Accuracy, Recall	Not reported
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents Feb 25, 2026	Automatic Metrics	Not reported	Pass@1, Latency	Not reported
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models Feb 25, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference Feb 25, 2026	Automatic Metrics	Not reported	Accuracy, Cost	Not reported
Inducing Epistemological Humility in Large Language Models: A Targeted SFT Approach to Reducing Hallucination Mar 18, 2026	Not reported	Pairwise Preference	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (10% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Strong: Papers naming benchmarks/datasets

Coverage is strong (100% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (63.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (3.3% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (13.3% vs 35% target).

Strengths

Most papers provide measurable evaluation context (100% benchmarks, 63.3% metrics).

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (3.3% coverage).
Annotation unit is under-specified (13.3% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (MMLU vs MATH-500) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: MMLU Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (3.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (14)
Llm As Judge (2)

Human Feedback Mix

Pairwise Preference (4)
Expert Verification (1)
Rubric Rating (1)

Top Benchmarks

MMLU (31)
MATH 500 (15)
SWE Bench (15)
GSM8K (13)

Top Metrics

Accuracy (20)
Cost (17)
Pass@1 (5)
Recall (4)

Top Papers On This Benchmark

PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Yiqing Zhang, Xiaozhong Liu, Fabricio Murai · Mar 28, 2026 · Citations: 0

Expert Verification Llm As JudgeAutomatic Metrics

In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners
Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan · Mar 4, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being…
How Reliable is Language Model Micro-Benchmarking?
Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta · Oct 9, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark.
KLong: Training LLM Agent for Extremely Long-horizon Tasks
Yue Liu, Yingwei Ma, Yibo Miao, Yanhao Li, Yuchong Xie · Feb 19, 2026 · Citations: 0

Rubric Rating

Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics.
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Jack Young · Apr 1, 2026 · Citations: 0

Automatic Metrics

Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval.
Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning
Juming Xiong, Kevin Guo, Congning Ni, Chao Yan, Katherine Brown · Mar 9, 2026 · Citations: 0

Automatic Metrics

Recent self-consistency-based approaches further improve accuracy but require sampling and aggregating multiple reasoning trajectories, leading to substantial additional computational overhead.
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0

Automatic Metrics

Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Shunsuke Ubukata · Feb 25, 2026 · Citations: 0

Automatic Metrics

In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as <TEMP_LOW> for fact-checking and <TEMP_HIGH> for multi-perspective exploration --…
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying · Dec 3, 2025 · Citations: 0

Automatic Metrics

Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBenchV2 and SCBench), TRIM-KV consistently outperforms…
Structurally Aligned Subtask-Level Memory for Software Engineering Agents
Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
DeepPrune: Parallel Scaling without Inter-trace Redundancy
Shangqing Tu, Yaxuan Li, Yushi Bai, Lei Hou, Juanzi Li · Oct 9, 2025 · Citations: 0

Llm As JudgeAutomatic Metrics

Our method features a specialized judge model trained with out-of-distribution data (AIME 2022, AIME 2023, and MATH 500) using oversampling techniques to accurately predict answer equivalence from partial reasoning traces, achieving 0.7072…
Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis
Tae-Eun Song · Mar 23, 2026 · Citations: 0

Automatic Metrics

LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly…
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026 · Citations: 0

Automatic Metrics

Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
Cost-Effective Communication: An Auction-based Method for Language Agent Interaction
Yijia Fan, Jusheng Zhang, Kaitong Cai, Jing Yang, Chengpei Tang · Nov 17, 2025 · Citations: 0

Automatic Metrics

To address this, we introduce the Dynamic Auction-based Language Agent (DALA), a novel framework that treats communication bandwidth as a scarce and tradable resource.
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Teng Wang, Zhangyi Jiang, Zhenqi He, Shenyang Tong, Wenhan Yang · Mar 16, 2025 · Citations: 0

Automatic Metrics

Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM.
Inducing Epistemological Humility in Large Language Models: A Targeted SFT Approach to Reducing Hallucination
Cem Uluoglakci, Tugba Taskaya Temizel · Mar 18, 2026 · Citations: 0

Pairwise Preference

We also release HypoTermQA-Enhanced, a benchmark for hallucination tendency strengthened through multiple validations.
Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale
David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu · Nov 7, 2025 · Citations: 0

Pairwise Preference

We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts…
TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping
Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, John D. Kelleher · Apr 22, 2026 · Citations: 0
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling
Noah Flynn · Apr 22, 2026 · Citations: 0
A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression
Jincheng Ren, Siwei Wu, Yizhi Li, Kang Zhu, Shu Xu · Apr 21, 2026 · Citations: 0
Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?
Sho Hoshino, Ukyo Honda, Peinan Zhang · Apr 21, 2026 · Citations: 0
RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models
Yusuf Çelebi, Yağız Asker, Özay Ezerceli, Mahmoud ElHussieni, Selva Taş · Apr 21, 2026 · Citations: 0
Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals
Jon-Paul Cacioli · Apr 20, 2026 · Citations: 0
MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression
Libo Sun, Peixiong He, Po-Wei Harn, Xiao Qin · Apr 20, 2026 · Citations: 0
Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity
Leon Engländer, Sophia Althammer, Ahmet Üstün, Matthias Gallé, Tom Sherborne · Apr 19, 2026 · Citations: 0
Jupiter-N Technical Report
George Drayson · Apr 19, 2026 · Citations: 0
AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency
Max Henning Höth, Kristian Kersting, Björn Deiseroth, Letitia Parcalabescu · Apr 17, 2026 · Citations: 0
Modeling LLM Unlearning as an Asymmetric Two-Task Learning Problem
Zeguan Xiao, Siqing Li, Yong Wang, Xuetao Wei, Jian Yang · Apr 16, 2026 · Citations: 0
Peer-Predictive Self-Training for Language Model Reasoning
Shi Feng, Hanlin Zhang, Fan Nie, Sham Kakade, Yiling Chen · Apr 14, 2026 · Citations: 0
Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking
Solomon Messing · Apr 13, 2026 · Citations: 0
SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel · Apr 9, 2026 · Citations: 0
Dead Weights, Live Signals: Feedforward Graphs of Frozen Language Models
Marcus Armstrong, Navid Ayoobi, Arjun Mukherjee · Apr 9, 2026 · Citations: 0
Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence
Niklas Herbster, Martin Zborowski, Alberto Tosato, Gauthier Gidel, Tommaso Tosato · Apr 9, 2026 · Citations: 0
Sensitivity-Positional Co-Localization in GQA Transformers
Manoj Chandrashekar Rao · Apr 9, 2026 · Citations: 0
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
Xiangyue Liu, Zijian Zhang, Miles Yang, Zhao Zhong, Liefeng Bo · Apr 9, 2026 · Citations: 0
Cross-Model Disagreement as a Label-Free Correctness Signal
Matt Gorbett, Suman Jana · Mar 26, 2026 · Citations: 0
Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients
Michael Hardy, Joshua Gilbert, Benjamin Domingue · Mar 26, 2026 · Citations: 0
SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling
Yiqi Zhang, Huiqiang Jiang, Xufang Luo, Zhihe Yang, Chengruidong Zhang · Mar 24, 2026 · Citations: 0
Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?
Richard J. Young · Mar 23, 2026 · Citations: 0
AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation
Liang Ding · Mar 22, 2026 · Citations: 0
FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair
Ruize Ma, Yilei Jiang, Shilin Zhang, Zheng Ma, Yi Feng · Mar 18, 2026 · Citations: 0
Are Large Language Models Truly Smarter Than Humans?
Eshwar Reddy M, Sourav Karmakar · Mar 17, 2026 · Citations: 0
daVinci-Env: Open SWE Environment Synthesis at Scale
Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang · Mar 13, 2026 · Citations: 0
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
Alliot Nagle, Jakhongir Saydaliev, Dhia Garbaya, Michael Gastpar, Ashok Vardhan Makkuva · Mar 13, 2026 · Citations: 0
NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation
Yuxin Yang, Haoran Zhang, Mingxuan Li, Jiachen Xu, Ruoxi Shen · Mar 12, 2026 · Citations: 0
In-Context Environments Induce Evaluation-Awareness in Language Models
Maheep Chaudhary · Mar 4, 2026 · Citations: 0
Tool Verification for Test-Time Reinforcement Learning
Ruotong Liao, Nikolai Röhrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh · Mar 2, 2026 · Citations: 0
Qwen3-Coder-Next Technical Report
Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng · Feb 28, 2026 · Citations: 0
Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs
Jie Cao, Tianwei Lin, Zhenxuan Fan, Bo Yuan, Ziyuan Zhao · Feb 28, 2026 · Citations: 0
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale
Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Alexander Golubev · Feb 27, 2026 · Citations: 0
Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents
Zhi Chen, Zhensu Sun, Yuling Shi, Chao Peng, Xiaodong Gu · Feb 8, 2026 · Citations: 0
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems
Badri N. Patro, Vijay S. Agneeswaran · Jan 20, 2026 · Citations: 0
TRIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks
Vansh Kapoor, Aman Gupta, Hao Chen, Anurag Beniwal, Jing Huang · Jan 15, 2026 · Citations: 0
PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models
Haoyu Zheng, Yun Zhu, Yuqian Yuan, Bo Yuan, Wenqiao Zhang · Jan 7, 2026 · Citations: 0
Training Language Models to Use Prolog as a Tool
Niklas Mellgren, Peter Schneider-Kamp, Lukas Galke Poech · Dec 8, 2025 · Citations: 0
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang · Oct 10, 2025 · Citations: 0
Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning
Jungsuk Oh, Jay-Yoon Lee · Aug 25, 2025 · Citations: 0
Strategic Scaling of Test-Time Compute: A Bandit Learning Approach
Bowen Zuo, Yinglun Zhu · Jun 15, 2025 · Citations: 0
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, Jeff Clune · May 29, 2025 · Citations: 0
Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios
Zhi Chen, Wei Ma, Lingxiao Jiang · Mar 16, 2025 · Citations: 0