HFEPX Benchmark Hub

DROP Or AlpacaEval Or GPQA Benchmark Papers

Updated from current HFEPX corpus (Apr 17, 2026). 57 papers are grouped in this benchmark page.

Read Full Context

Updated from current HFEPX corpus (Apr 17, 2026). 57 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Gold Questions. Frequently cited benchmark: AlpacaEval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Dec 26, 2025.

Papers: 57 Last published: Dec 26, 2025 Global RSS

Researcher Quick Triage

Use this page for benchmark-matched method comparisons and eval protocol selection. Quality band: High .

High-Signal Coverage

100.0%

57 / 57 sampled papers are not low-signal flagged.

Replication-Ready Set

Papers with explicit benchmark + metric + eval mode fields.

Quality Controls

1.8%

1 papers report calibration/adjudication/IAA controls.

20 papers explicitly name benchmark datasets in the sampled set.
12 papers report at least one metric term in metadata extraction.
Start with the ranked shortlist below before reading all papers.

Primary action: Start with the top 2 benchmark-matched papers, then compare evaluation modes in the protocol matrix.

Why This Matters (Expanded)

Why This Matters For Eval Research

70% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 21.1% of papers in this hub.
AlpacaEval is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Most common quality-control signal is gold-question checks (1.8% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Benchmark Interpretation

AlpacaEval appears in 40% of hub papers (8/57); use this cohort for benchmark-matched comparisons.
DROP appears in 40% of hub papers (8/57); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 45% of hub papers (9/57); compare with a secondary metric before ranking methods.
auroc is reported in 10% of hub papers (2/57); compare with a secondary metric before ranking methods.

Start Here (Benchmark-Matched First 6)

Ranked by protocol completeness so you can quickly find papers suitable for comparison studies.

CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Dec 26, 2025 · Citations: 0 · Score: 9.0

Eval: Automatic Metrics · Metrics: Accuracy
DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
Mar 23, 2026 · Citations: 0 · Score: 8.5

Eval: Automatic Metrics · Metrics: Accuracy
Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment
Feb 14, 2026 · Citations: 0 · Score: 8.0

Eval: Automatic Metrics · Metrics: Elo
FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data
Mar 16, 2026 · Citations: 0 · Score: 8.0

Eval: Automatic Metrics · Metrics: Accuracy
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
Oct 27, 2025 · Citations: 0 · Score: 7.5

Eval: Automatic Metrics · Metrics: Mse
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Aug 28, 2025 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Compare protocol ingredients quickly before deep-reading full papers.

Paper	Eval Modes	Human Feedback	Metrics	Quality Controls
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics Dec 26, 2025	Automatic Metrics	Expert Verification	Accuracy	Gold Questions
DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment Mar 23, 2026	Automatic Metrics	Pairwise Preference	Accuracy	Not reported
Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment Feb 14, 2026	Automatic Metrics	Pairwise Preference	Elo	Not reported
FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data Mar 16, 2026	Automatic Metrics	Expert Verification	Accuracy, Auroc	Not reported
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA Oct 27, 2025	Automatic Metrics	Pairwise Preference	Mse	Not reported
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP Aug 28, 2025	Automatic Metrics	Red Team	Accuracy	Not reported
LLM-as-a-Judge for Time Series Explanations Apr 2, 2026	Llm As Judge, Automatic Metrics	Not reported	Accuracy, Faithfulness	Not reported
Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes Mar 15, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models Feb 25, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild? Feb 3, 2026	Automatic Metrics	Not reported	Accuracy	Not reported

Researcher Workflow (Detailed)

Checklist

Strong: Papers with explicit human feedback

Coverage is strong (70% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (5% vs 30% target).
Strong: Papers naming benchmarks/datasets

Coverage is strong (100% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (60% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (15% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (30% vs 35% target).

Strengths

Strong human-feedback signal (70% of papers).
Most papers provide measurable evaluation context (100% benchmarks, 60% metrics).
Agentic evaluation appears in 25% of papers.

Known Gaps

Only 5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (15% coverage).
LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (AlpacaEval vs DROP) before comparing methods.
Track metric sensitivity by reporting both accuracy and auroc.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: AlpacaEval Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (15% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (12)
Llm As Judge (2)
Simulation Env (1)

Human Feedback Mix

Pairwise Preference (9)
Demonstrations (2)
Expert Verification (2)
Red Team (1)

Top Benchmarks

AlpacaEval (8)
DROP (8)
AlpacaEval 2.0 (4)
Arena Hard (4)

Top Metrics

Accuracy (9)
Auroc (2)
Cost (1)
Elo (1)

Top Papers On This Benchmark

CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0

Expert Verification Automatic Metrics

To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment
Jing Zhao, Ting Zhen, Junwei Bao, Hongfei Jiang, Yang Song · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Current alignment methods for Large Language Models (LLMs) rely on compressing vast amounts of human preference data into static, absolute reward functions, leading to data scarcity, noise sensitivity, and training instability.
FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data
Mitul Goswami, Romit Chatterjee, Arif Ahmed Sekh · Mar 16, 2026 · Citations: 0

Expert Verification Automatic Metrics

Post-mitigation evaluation on seven clinically distinct cohorts derived from the MIMIC-IV-ED and eICU databases demonstrates substantial bias reduction: Statistical Parity Difference decreases by 40 to 51 percent on MIMIC-IV-ED and 10 to 19…
Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li · May 20, 2025 · Citations: 0

Demonstrations Simulation Env

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.
DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
James Wedgwood, Aashiq Muhamed, Mona T. Diab, Virginia Smith · Mar 23, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility.
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
Zhichao Wang · Oct 27, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

This paper proposes Group-relative Implicit Fine-Tuning (GIFT), a reinforcement learning framework for aligning large language models (LLMs) that unifies on-policy optimization with implicit preference learning.
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin · Aug 28, 2025 · Citations: 0

Red Team Automatic Metrics

These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
LLM-as-a-Judge for Time Series Explanations
Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar · Apr 2, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics

Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional…
Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes
Deepon Halder, Raj Dabre · Mar 15, 2026 · Citations: 0

Automatic Metrics

Empirical validation on GPQA and GSM8K benchmarks indicates that Top-b significantly reduces generation entropy and inter-decoding variance while maintaining competitive reasoning accuracy, effectively approximating a self-regulating…
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Shunsuke Ubukata · Feb 25, 2026 · Citations: 0

Automatic Metrics

In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as <TEMP_LOW> for fact-checking and <TEMP_HIGH> for multi-perspective exploration --…
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song · Jun 17, 2025 · Citations: 0

Automatic Metrics

We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents.
PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch
Shangjian Yin, Shining Liang, Wenbiao Ding, Yuli Qian, Zhouxing Shi · Oct 8, 2025 · Citations: 0

Pairwise Preference

Despite its small size, fine-tuning Llama-3-8B-Base on PiKa-SFT even outperforms the official Llama-3-8B-Instruct model trained on over 10M proprietary examples on widely used benchmarks such as AlpacaEval 2.0 and Arena-Hard.
Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty
Yao Xiao, Jung-jae Kim, Roy Ka-wei Lee, Lidong Bing · Oct 7, 2025 · Citations: 0

Pairwise Preference

Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs).
DeepPrune: Parallel Scaling without Inter-trace Redundancy
Shangqing Tu, Yaxuan Li, Yushi Bai, Lei Hou, Juanzi Li · Oct 9, 2025 · Citations: 0

Llm As JudgeAutomatic Metrics

Our method features a specialized judge model trained with out-of-distribution data (AIME 2022, AIME 2023, and MATH 500) using oversampling techniques to accurately predict answer equivalence from partial reasoning traces, achieving 0.7072…
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar · Feb 3, 2026 · Citations: 0

Automatic Metrics

To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts.
TARo: Token-level Adaptive Routing for LLM Test-time Alignment
Arushi Rai, Qiang Zhang, Hanqing Zeng, Yunkai Zhang, Dipesh Tamboli · Mar 19, 2026 · Citations: 0

Pairwise Preference

Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning.
ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays
Aishik Sanyal · Feb 26, 2026 · Citations: 0

Pairwise Preference

Inspired by Humphrey's ipsundrum hypothesis, we implement ReCoN-Ipsundrum, an inspectable agent that extends a ReCoN state machine with a recurrent persistence loop over sensory salience Ns and an optional affect proxy reporting…
Schema for In-Context Learning
Pan Chen, Shaohong Chen, Mark Wang, Shi Xuan Leong, Priscilla Fung · Oct 14, 2025 · Citations: 0

Demonstrations

Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce Schema-Activated In-Context…
Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization
Junming Yang, Ning Xu, Biao Liu, Shiqi Qiao, Xin Geng · Sep 27, 2025 · Citations: 0

Pairwise Preference

To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training.
Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms
Zeguan Xiao, Yun Chen, Guanhua Chen, Ke Tang · Jun 11, 2025 · Citations: 0

Pairwise Preference

Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning…
Context Over Content: Exposing Evaluation Faking in Automated Judges
Manan Gupta, Inderjeet Nair, Lu Wang, Dhruv Kumar · Apr 16, 2026 · Citations: 0
An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2
Ryan Lail · Apr 15, 2026 · Citations: 0
IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages
Aviral Dawar, Roshan Karanth, Vikram Goyal, Dhruv Kumar · Apr 15, 2026 · Citations: 0
C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences
Akira Kawabata, Saku Sugawara · Apr 15, 2026 · Citations: 0
One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness
Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, Massoud Pedram · Apr 14, 2026 · Citations: 0
Think in Sentences: Explicit Sentence Boundaries Enhance Language Model's Capabilities
Zhichen Liu, Yongyuan Li, Yang Xu · Apr 11, 2026 · Citations: 0
Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions
Utshab Kumar Ghosh, Ashish David, Shubham Chatterjee · Apr 11, 2026 · Citations: 0
Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence
Niklas Herbster, Martin Zborowski, Alberto Tosato, Gauthier Gidel, Tommaso Tosato · Apr 9, 2026 · Citations: 0
Sensitivity-Positional Co-Localization in GQA Transformers
Manoj Chandrashekar Rao · Apr 9, 2026 · Citations: 0
Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang · Apr 9, 2026 · Citations: 0
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
David Gringras · Apr 9, 2026 · Citations: 0
Ego-Grounding for Personalized Question-Answering in Egocentric Videos
Junbin Xiao, Shenglang Zhang, Pengxiang Zhu, Angela Yao · Apr 2, 2026 · Citations: 0
Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models
Chengyu Fang, Heng Guo, Zheng Jiang, Chunming He, Xiu Li · Mar 26, 2026 · Citations: 0
Mechanistically Interpreting Compression in Vision-Language Models
Veeraraju Elluru, Arth Singh, Roberto Aguero, Ajay Agarwal, Debojyoti Das · Mar 26, 2026 · Citations: 0
Off-Policy Value-Based Reinforcement Learning for Large Language Models
Peng-Yuan Wang, Ziniu Li, Tian Xu, Bohan Yang, Tian-Shuo Liu · Mar 24, 2026 · Citations: 0
RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue
Long Mai · Mar 24, 2026 · Citations: 0
Edge Radar Material Classification Under Geometry Shifts
Jannik Hohmann, Dong Wang, Andreas Nüchter · Mar 24, 2026 · Citations: 0
Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?
Richard J. Young · Mar 23, 2026 · Citations: 0
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna · Mar 18, 2026 · Citations: 0
Mediocrity is the key for LLM as a Judge Anchor Selection
Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen, Omri Abend · Mar 17, 2026 · Citations: 0
Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies
Giuseppe Samo, Paola Merlo · Mar 16, 2026 · Citations: 0
Attention Residuals
Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu · Mar 16, 2026 · Citations: 0
MXNorm: Reusing MXFP block scales for efficient tensor normalisation
Callum McLean, Luke Y. Prince, Alexandre Payot, Paul Balança, Carlo Luschi · Mar 13, 2026 · Citations: 0
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
Alliot Nagle, Jakhongir Saydaliev, Dhia Garbaya, Michael Gastpar, Ashok Vardhan Makkuva · Mar 13, 2026 · Citations: 0
LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories
Qianpu Sun, Xiaowei Chi, Yuhan Rui, Ying Li, Kuangzhi Ge · Mar 12, 2026 · Citations: 0
UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking
Chang Liu, Chuqiao Kuang, Tianyi Zhuang, Yuxin Cheng, Huichi Zhou · Mar 9, 2026 · Citations: 0
Dynamic Vehicle Routing Problem with Prompt Confirmation of Advance Requests
Amutheezan Sivagnanam, Ayan Mukhopadhyay, Samitha Samaranayake, Abhishek Dubey, Aron Laszka · Mar 8, 2026 · Citations: 0
Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning
Chuang Zhang, Zizhen Zhu, Yihao Wei, Bing Tian, Junyi Liu · Mar 4, 2026 · Citations: 0
CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning
Xinyu Zhu, Yihao Feng, Yanchao Sun, Xianzhi Du, Pingzhi Li · Mar 1, 2026 · Citations: 0
Polynomial Mixing for Efficient Self-supervised Speech Encoders
Eva Feillet, Ryan Whetten, David Picard, Alexandre Allauzen · Feb 28, 2026 · Citations: 0
Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs
Jie Cao, Tianwei Lin, Zhenxuan Fan, Bo Yuan, Ziyuan Zhao · Feb 28, 2026 · Citations: 0
Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
Qihua Dong, Kuo Yang, Lin Ju, Handong Zhao, Yitian Zhang · Feb 27, 2026 · Citations: 0
Agentic Explainable Artificial Intelligence (Agentic XAI) Approach To Explore Better Explanation
Tomoaki Yamaguchi, Yutong Zhou, Masahiro Ryo, Keisuke Katsura · Dec 24, 2025 · Citations: 0
Revisiting the Reliability of Language Models in Instruction-Following
Jianshuo Dong, Yutong Zhang, Yan Liu, Zhenyu Zhong, Tao Wei · Dec 15, 2025 · Citations: 0
Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Chao Shen · Dec 8, 2025 · Citations: 0
Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space
Tomas Figliolia, Nicholas Alonso, Rishi Iyer, Quentin Anthony, Beren Millidge · Oct 6, 2025 · Citations: 0
SafeSieve: From Heuristics to Experience in Progressive Pruning for LLM-based Multi-Agent Communication
Ruijia Zhang, Xinyan Zhao, Ruixiang Wang, Sigen Chen, Guibin Zhang · Aug 15, 2025 · Citations: 0