HFEPX Benchmark Hub

MMLU Or LMSYS Chatbot Arena Or HotpotQA Benchmark Papers

Updated from current HFEPX corpus (Apr 25, 2026). 57 papers are grouped in this benchmark page.

Read Full Context

Updated from current HFEPX corpus (Apr 25, 2026). 57 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: LMSYS Chatbot Arena. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 13, 2026.

Papers: 57 Last published: Feb 13, 2026 Global RSS

Researcher Quick Triage

Use this page for benchmark-matched method comparisons and eval protocol selection. Quality band: High .

High-Signal Coverage

100.0%

57 / 57 sampled papers are not low-signal flagged.

Replication-Ready Set

Papers with explicit benchmark + metric + eval mode fields.

Quality Controls

1.8%

1 papers report calibration/adjudication/IAA controls.

24 papers explicitly name benchmark datasets in the sampled set.
12 papers report at least one metric term in metadata extraction.
Start with the ranked shortlist below before reading all papers.

Primary action: Start with the top 2 benchmark-matched papers, then compare evaluation modes in the protocol matrix.

Why This Matters (Expanded)

Why This Matters For Eval Research

66.7% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 21.1% of papers in this hub.
LMSYS Chatbot Arena is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Most common quality-control signal is rater calibration (1.8% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Benchmark Interpretation

LMSYS Chatbot Arena appears in 41.7% of hub papers (10/57); use this cohort for benchmark-matched comparisons.
HotpotQA appears in 29.2% of hub papers (7/57); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 37.5% of hub papers (9/57); compare with a secondary metric before ranking methods.
cost is reported in 20.8% of hub papers (5/57); compare with a secondary metric before ranking methods.

Start Here (Benchmark-Matched First 6)

Ranked by protocol completeness so you can quickly find papers suitable for comparison studies.

SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Feb 13, 2026 · Citations: 0 · Score: 9.5

Eval: Automatic Metrics · Metrics: Error rate
PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Mar 28, 2026 · Citations: 0 · Score: 8.5

Eval: Llm As Judge, Automatic Metrics · Metrics: Accuracy
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
Oct 27, 2025 · Citations: 0 · Score: 7.5

Eval: Automatic Metrics · Metrics: Mse
How Reliable is Language Model Micro-Benchmarking?
Oct 9, 2025 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Accuracy
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents
Apr 6, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Accuracy
OSCAR: Orchestrated Self-verification and Cross-path Refinement
Apr 2, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Compare protocol ingredients quickly before deep-reading full papers.

Paper	Eval Modes	Human Feedback	Metrics	Quality Controls
SCOPE: Selective Conformal Optimized Pairwise LLM Judging Feb 13, 2026	Automatic Metrics	Pairwise Preference	Error rate	Calibration
PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering Mar 28, 2026	Llm As Judge, Automatic Metrics	Expert Verification	Accuracy, Relevance	Not reported
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA Oct 27, 2025	Automatic Metrics	Pairwise Preference	Mse	Not reported
How Reliable is Language Model Micro-Benchmarking? Oct 9, 2025	Automatic Metrics	Pairwise Preference	Accuracy, Cost	Not reported
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents Apr 6, 2026	Automatic Metrics	Not reported	Accuracy, Recall	Not reported
OSCAR: Orchestrated Self-verification and Cross-path Refinement Apr 2, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning Mar 9, 2026	Automatic Metrics	Not reported	Accuracy, Cost	Not reported
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models Feb 25, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference Feb 25, 2026	Automatic Metrics	Not reported	Accuracy, Cost	Not reported
Inducing Epistemological Humility in Large Language Models: A Targeted SFT Approach to Reducing Hallucination Mar 18, 2026	Not reported	Pairwise Preference	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Strong: Papers with explicit human feedback

Coverage is strong (66.7% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (4.2% vs 30% target).
Strong: Papers naming benchmarks/datasets

Coverage is strong (100% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (50% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (8.3% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (33.3% vs 35% target).

Strengths

Strong human-feedback signal (66.7% of papers).
Most papers provide measurable evaluation context (100% benchmarks, 50% metrics).
Agentic evaluation appears in 33.3% of papers.

Known Gaps

Only 4.2% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.3% coverage).
LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (LMSYS Chatbot Arena vs HotpotQA) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: LMSYS Chatbot Arena Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 4.2% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (12)
Llm As Judge (2)
Simulation Env (2)

Human Feedback Mix

Pairwise Preference (13)
Demonstrations (2)
Expert Verification (1)

Top Benchmarks

LMSYS Chatbot Arena (10)
HotpotQA (7)
MMLU (7)
Arena Hard (5)

Top Metrics

Accuracy (9)
Cost (5)
F1 (2)
Latency (2)

Top Papers On This Benchmark

SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Sher Badshah, Ali Emami, Hassan Sajjad · Feb 13, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.
PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Yiqing Zhang, Xiaozhong Liu, Fabricio Murai · Mar 28, 2026 · Citations: 0

Expert Verification Llm As JudgeAutomatic Metrics

In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
How Reliable is Language Model Micro-Benchmarking?
Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta · Oct 9, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark.
WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics
Chenxu Liu, Yingjie Fu, Wei Yang, Ying Zhang, Tao Xie · Jan 5, 2026 · Citations: 0

Pairwise Preference Llm As Judge

However, building a benchmark for LLM-generated web apps remains challenging due to the need for real-world user requirements, generalizable evaluation metrics without relying on ground-truth implementations or test cases, and interpretable…
DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning
Chenyang Gu, Yewen Pu, Bruce Yang, Xiaofan Li, Huan Gao · Oct 10, 2025 · Citations: 0

Demonstrations Simulation Env

Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped.
Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li · May 20, 2025 · Citations: 0

Demonstrations Simulation Env

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
Zhichao Wang · Oct 27, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

This paper proposes Group-relative Implicit Fine-Tuning (GIFT), a reinforcement learning framework for aligning large language models (LLMs) that unifies on-policy optimization with implicit preference learning.
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents
Shu Wang, Edwin Yu, Oscar Love, Tom Zhang, Tom Wong · Apr 6, 2026 · Citations: 0

Automatic Metrics

Large Language Model (LLM) agents require persistent memory to maintain personalization, factual continuity, and long-horizon reasoning, yet standard context-window and retrieval-augmented generation (RAG) pipelines degrade over…
OSCAR: Orchestrated Self-verification and Cross-path Refinement
Yash Shah, Abhijit Chakraborty, Naresh Kumar Devulapally, Vishnu Lokhande, Vivek Gupta · Apr 2, 2026 · Citations: 0

Automatic Metrics

We introduce a suite of trajectory-level assessments, including a cross-chain divergence-at-hallucination (CDH) metric, for principled comparison of localization methods.
Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning
Juming Xiong, Kevin Guo, Congning Ni, Chao Yan, Katherine Brown · Mar 9, 2026 · Citations: 0

Automatic Metrics

Recent self-consistency-based approaches further improve accuracy but require sampling and aggregating multiple reasoning trajectories, leading to substantial additional computational overhead.
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Shunsuke Ubukata · Feb 25, 2026 · Citations: 0

Automatic Metrics

In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as <TEMP_LOW> for fact-checking and <TEMP_HIGH> for multi-perspective exploration --…
PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch
Shangjian Yin, Shining Liang, Wenbiao Ding, Yuli Qian, Zhouxing Shi · Oct 8, 2025 · Citations: 0

Pairwise Preference

Despite its small size, fine-tuning Llama-3-8B-Base on PiKa-SFT even outperforms the official Llama-3-8B-Instruct model trained on over 10M proprietary examples on widely used benchmarks such as AlpacaEval 2.0 and Arena-Hard.
Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty
Yao Xiao, Jung-jae Kim, Roy Ka-wei Lee, Lidong Bing · Oct 7, 2025 · Citations: 0

Pairwise Preference

Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs).
Evaluation of Large Language Models via Coupled Token Generation
Nina Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco · Feb 3, 2025 · Citations: 0

Pairwise Preference

In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning.
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026 · Citations: 0

Automatic Metrics

Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA
Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025 · Citations: 0

Automatic Metrics

A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes…
Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs
Ziliang Wang, Kang An, Xuhui Zheng, Faqiang Qian, Weikun Zhang · Oct 1, 2025 · Citations: 0

Automatic Metrics

We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process.
CLAUSE: Agentic Neuro-Symbolic Knowledge Graph Reasoning via Dynamic Learnable Context Engineering
Yang Zhao, Chengxiao Dai, Wei Zhuo, Yue Xiu, Dusit Niyato · Sep 25, 2025 · Citations: 0

Automatic Metrics

We introduce CLAUSE, an agentic three-agent neuro-symbolic framework that treats context construction as a sequential decision process over knowledge graphs, deciding what to expand, which paths to follow or backtrack, what evidence to…
Inducing Epistemological Humility in Large Language Models: A Targeted SFT Approach to Reducing Hallucination
Cem Uluoglakci, Tugba Taskaya Temizel · Mar 18, 2026 · Citations: 0

Pairwise Preference

We also release HypoTermQA-Enhanced, a benchmark for hallucination tendency strengthened through multiple validations.
Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
Yixin Liu, Yue Yu, DiJia Su, Sid Wang, Xuewei Wang · Mar 12, 2026 · Citations: 0

Pairwise Preference

Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked.
Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale
David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu · Nov 7, 2025 · Citations: 0

Pairwise Preference

We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts…
Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization
Junming Yang, Ning Xu, Biao Liu, Shiqi Qiao, Xin Geng · Sep 27, 2025 · Citations: 0

Pairwise Preference

To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training.
A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench
David Schlangen, Sherzod Hakimov, Chalamalasetti Kranti, Jonathan Jordan, Philipp Sadler · Jul 11, 2025 · Citations: 0

Pairwise Preference

There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation.
Search Arena: Analyzing Search-Augmented LLMs
Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan · Jun 5, 2025 · Citations: 0

Pairwise Preference

In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs.
MathDuels: Evaluating LLMs as Problem Posers and Solvers
Zhiqiu Xu, Shibo Jin, Shreya Arya, Mayur Naik · Apr 23, 2026 · Citations: 0
TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping
Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, John D. Kelleher · Apr 22, 2026 · Citations: 0
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling
Noah Flynn · Apr 22, 2026 · Citations: 0
Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?
Sho Hoshino, Ukyo Honda, Peinan Zhang · Apr 21, 2026 · Citations: 0
RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models
Yusuf Çelebi, Yağız Asker, Özay Ezerceli, Mahmoud ElHussieni, Selva Taş · Apr 21, 2026 · Citations: 0
Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals
Jon-Paul Cacioli · Apr 20, 2026 · Citations: 0
Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems
Tianyi Huang, Samuel Xu, Jason Tansong Dang, Samuel Yan, Kimberley Yin · Apr 19, 2026 · Citations: 0
Jupiter-N Technical Report
George Drayson · Apr 19, 2026 · Citations: 0
AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency
Max Henning Höth, Kristian Kersting, Björn Deiseroth, Letitia Parcalabescu · Apr 17, 2026 · Citations: 0
Modeling LLM Unlearning as an Asymmetric Two-Task Learning Problem
Zeguan Xiao, Siqing Li, Yong Wang, Xuetao Wei, Jian Yang · Apr 16, 2026 · Citations: 0
Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking
Solomon Messing · Apr 13, 2026 · Citations: 0
SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel · Apr 9, 2026 · Citations: 0
Dead Weights, Live Signals: Feedforward Graphs of Frozen Language Models
Marcus Armstrong, Navid Ayoobi, Arjun Mukherjee · Apr 9, 2026 · Citations: 0
Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence
Niklas Herbster, Martin Zborowski, Alberto Tosato, Gauthier Gidel, Tommaso Tosato · Apr 9, 2026 · Citations: 0
Sensitivity-Positional Co-Localization in GQA Transformers
Manoj Chandrashekar Rao · Apr 9, 2026 · Citations: 0
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
Xiangyue Liu, Zijian Zhang, Miles Yang, Zhao Zhong, Liefeng Bo · Apr 9, 2026 · Citations: 0
Cross-Model Disagreement as a Label-Free Correctness Signal
Matt Gorbett, Suman Jana · Mar 26, 2026 · Citations: 0
Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients
Michael Hardy, Joshua Gilbert, Benjamin Domingue · Mar 26, 2026 · Citations: 0
Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?
Richard J. Young · Mar 23, 2026 · Citations: 0
Mediocrity is the key for LLM as a Judge Anchor Selection
Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen, Omri Abend · Mar 17, 2026 · Citations: 0
IndexRAG: Bridging Facts for Cross-Document Reasoning at Index Time
Zhenghua Bao, Yi Shi · Mar 17, 2026 · Citations: 0
Are Large Language Models Truly Smarter Than Humans?
Eshwar Reddy M, Sourav Karmakar · Mar 17, 2026 · Citations: 0
When LLM Judge Scores Look Good but Best-of-N Decisions Fail
Eddie Landesberg · Mar 12, 2026 · Citations: 0
NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation
Yuxin Yang, Haoran Zhang, Mingxuan Li, Jiachen Xu, Ruoxi Shen · Mar 12, 2026 · Citations: 0
In-Context Environments Induce Evaluation-Awareness in Language Models
Maheep Chaudhary · Mar 4, 2026 · Citations: 0
SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation
Jiahao Zhao, Feng Jiang, Shaowei Qin, Zhonghui Zhang, Junhao Liu · Feb 26, 2026 · Citations: 0
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems
Badri N. Patro, Vijay S. Agneeswaran · Jan 20, 2026 · Citations: 0
LSTM-MAS: A Long Short-Term Memory Inspired Multi-Agent System for Long-Context Understanding
Yichen Jiang, Jiakang Yuan, Chongjun Tu, Peng Ye, Tao Chen · Jan 17, 2026 · Citations: 0
SocraticKG: Knowledge Graph Construction via QA-Driven Fact Extraction
Sanghyeok Choi, Woosang Jeon, Kyuseok Yang, Taehyeong Kim · Jan 15, 2026 · Citations: 0
Training Language Models to Use Prolog as a Tool
Niklas Mellgren, Peter Schneider-Kamp, Lukas Galke Poech · Dec 8, 2025 · Citations: 0
Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning
Jungsuk Oh, Jay-Yoon Lee · Aug 25, 2025 · Citations: 0
FrugalRAG: Less is More in RL Finetuning for Multi-Hop Question Answering
Abhinav Java, Srivathsan Koundinyan, Nagarajan Natarajan, Amit Sharma · Jul 10, 2025 · Citations: 0
An Automated Survey of Generative Artificial Intelligence: Large Language Models, Architectures, Protocols, and Applications
Eduardo C. Garrido-Merchán, Álvaro López López · Jun 5, 2023 · Citations: 0