HFEPX Benchmark Hub

DROP Or GPQA Or HotpotQA Benchmark Papers

Updated from current HFEPX corpus (Apr 25, 2026). 69 papers are grouped in this benchmark page.

Read Full Context

Updated from current HFEPX corpus (Apr 25, 2026). 69 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Gold Questions. Frequently cited benchmark: DROP. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Dec 26, 2025.

Papers: 69 Last published: Dec 26, 2025 Global RSS

Researcher Quick Triage

Use this page for benchmark-matched method comparisons and eval protocol selection. Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 69 papers).

High-Signal Coverage

100.0%

60 / 60 sampled papers are not low-signal flagged.

Replication-Ready Set

Papers with explicit benchmark + metric + eval mode fields.

Quality Controls

1.7%

1 papers report calibration/adjudication/IAA controls.

18 papers explicitly name benchmark datasets in the sampled set.
14 papers report at least one metric term in metadata extraction.
Start with the ranked shortlist below before reading all papers.

Primary action: Start with the top 2 benchmark-matched papers, then compare evaluation modes in the protocol matrix.

Why This Matters (Expanded)

Why This Matters For Eval Research

10.1% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 20.3% of papers in this hub.
DROP is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Most common quality-control signal is gold-question checks (1.4% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Benchmark Interpretation

DROP appears in 65.2% of hub papers (45/69); use this cohort for benchmark-matched comparisons.
GPQA appears in 18.8% of hub papers (13/69); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 39.1% of hub papers (27/69); compare with a secondary metric before ranking methods.
cost is reported in 21.7% of hub papers (15/69); compare with a secondary metric before ranking methods.

Start Here (Benchmark-Matched First 6)

Ranked by protocol completeness so you can quickly find papers suitable for comparison studies.

CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Dec 26, 2025 · Citations: 0 · Score: 9.0

Eval: Automatic Metrics · Metrics: Accuracy
FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data
Mar 16, 2026 · Citations: 0 · Score: 8.0

Eval: Automatic Metrics · Metrics: Accuracy
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Aug 28, 2025 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Accuracy
LLM-as-a-Judge for Time Series Explanations
Apr 2, 2026 · Citations: 0 · Score: 7.0

Eval: Llm As Judge, Automatic Metrics · Metrics: Accuracy
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents
Apr 6, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Accuracy
OSCAR: Orchestrated Self-verification and Cross-path Refinement
Apr 2, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Compare protocol ingredients quickly before deep-reading full papers.

Paper	Eval Modes	Human Feedback	Metrics	Quality Controls
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics Dec 26, 2025	Automatic Metrics	Expert Verification	Accuracy	Gold Questions
FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data Mar 16, 2026	Automatic Metrics	Expert Verification	Accuracy, Auroc	Not reported
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP Aug 28, 2025	Automatic Metrics	Red Team	Accuracy	Not reported
LLM-as-a-Judge for Time Series Explanations Apr 2, 2026	Llm As Judge, Automatic Metrics	Not reported	Accuracy, Faithfulness	Not reported
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents Apr 6, 2026	Automatic Metrics	Not reported	Accuracy, Recall	Not reported
OSCAR: Orchestrated Self-verification and Cross-path Refinement Apr 2, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes Mar 15, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models Feb 25, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild? Feb 3, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays Feb 26, 2026	Not reported	Pairwise Preference	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (10.1% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (1.4% vs 30% target).
Strong: Papers naming benchmarks/datasets

Coverage is strong (100% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (66.7% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (2.9% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (8.7% vs 35% target).

Strengths

Most papers provide measurable evaluation context (100% benchmarks, 66.7% metrics).

Known Gaps

Only 1.4% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (2.9% coverage).
Annotation unit is under-specified (8.7% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (DROP vs GPQA) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: DROP Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 1.4% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (2.9% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (14)
Llm As Judge (2)
Simulation Env (2)

Human Feedback Mix

Demonstrations (3)
Expert Verification (2)
Pairwise Preference (1)
Red Team (1)

Top Benchmarks

DROP (45)
GPQA (13)
HotpotQA (12)
AIME (6)

Top Metrics

Accuracy (27)
Cost (15)
Latency (5)
F1 (4)

Top Papers On This Benchmark

CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0

Expert Verification Automatic Metrics

To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data
Mitul Goswami, Romit Chatterjee, Arif Ahmed Sekh · Mar 16, 2026 · Citations: 0

Expert Verification Automatic Metrics

Post-mitigation evaluation on seven clinically distinct cohorts derived from the MIMIC-IV-ED and eICU databases demonstrates substantial bias reduction: Statistical Parity Difference decreases by 40 to 51 percent on MIMIC-IV-ED and 10 to 19…
DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning
Chenyang Gu, Yewen Pu, Bruce Yang, Xiaofan Li, Huan Gao · Oct 10, 2025 · Citations: 0

Demonstrations Simulation Env

Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped.
Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li · May 20, 2025 · Citations: 0

Demonstrations Simulation Env

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin · Aug 28, 2025 · Citations: 0

Red Team Automatic Metrics

These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
LLM-as-a-Judge for Time Series Explanations
Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar · Apr 2, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics

Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional…
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents
Shu Wang, Edwin Yu, Oscar Love, Tom Zhang, Tom Wong · Apr 6, 2026 · Citations: 0

Automatic Metrics

Large Language Model (LLM) agents require persistent memory to maintain personalization, factual continuity, and long-horizon reasoning, yet standard context-window and retrieval-augmented generation (RAG) pipelines degrade over…
OSCAR: Orchestrated Self-verification and Cross-path Refinement
Yash Shah, Abhijit Chakraborty, Naresh Kumar Devulapally, Vishnu Lokhande, Vivek Gupta · Apr 2, 2026 · Citations: 0

Automatic Metrics

We introduce a suite of trajectory-level assessments, including a cross-chain divergence-at-hallucination (CDH) metric, for principled comparison of localization methods.
Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes
Deepon Halder, Raj Dabre · Mar 15, 2026 · Citations: 0

Automatic Metrics

Empirical validation on GPQA and GSM8K benchmarks indicates that Top-b significantly reduces generation entropy and inter-decoding variance while maintaining competitive reasoning accuracy, effectively approximating a self-regulating…
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Shunsuke Ubukata · Feb 25, 2026 · Citations: 0

Automatic Metrics

In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as <TEMP_LOW> for fact-checking and <TEMP_HIGH> for multi-perspective exploration --…
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song · Jun 17, 2025 · Citations: 0

Automatic Metrics

We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents.
DeepPrune: Parallel Scaling without Inter-trace Redundancy
Shangqing Tu, Yaxuan Li, Yushi Bai, Lei Hou, Juanzi Li · Oct 9, 2025 · Citations: 0

Llm As JudgeAutomatic Metrics

Our method features a specialized judge model trained with out-of-distribution data (AIME 2022, AIME 2023, and MATH 500) using oversampling techniques to accurately predict answer equivalence from partial reasoning traces, achieving 0.7072…
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar · Feb 3, 2026 · Citations: 0

Automatic Metrics

To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts.
RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA
Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025 · Citations: 0

Automatic Metrics

A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes…
Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs
Ziliang Wang, Kang An, Xuhui Zheng, Faqiang Qian, Weikun Zhang · Oct 1, 2025 · Citations: 0

Automatic Metrics

We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process.
CLAUSE: Agentic Neuro-Symbolic Knowledge Graph Reasoning via Dynamic Learnable Context Engineering
Yang Zhao, Chengxiao Dai, Wei Zhuo, Yue Xiu, Dusit Niyato · Sep 25, 2025 · Citations: 0

Automatic Metrics

We introduce CLAUSE, an agentic three-agent neuro-symbolic framework that treats context construction as a sequential decision process over knowledge graphs, deciding what to expand, which paths to follow or backtrack, what evidence to…
ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays
Aishik Sanyal · Feb 26, 2026 · Citations: 0

Pairwise Preference

Inspired by Humphrey's ipsundrum hypothesis, we implement ReCoN-Ipsundrum, an inspectable agent that extends a ReCoN state machine with a recurrent persistence loop over sensory salience Ns and an optional affect proxy reporting…
Schema for In-Context Learning
Pan Chen, Shaohong Chen, Mark Wang, Shi Xuan Leong, Priscilla Fung · Oct 14, 2025 · Citations: 0

Demonstrations

Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce Schema-Activated In-Context…
Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems
Ye Yu, Heming Liu, Haibo Jin, Xiaopeng Yuan, Peng Kuang · Apr 23, 2026 · Citations: 0
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
Hao-Yuan Chen · Apr 23, 2026 · Citations: 0
Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts
Azher Ahmed Efat, Seok Hwan Song, Wallapak Tavanapong · Apr 23, 2026 · Citations: 0
TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping
Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, John D. Kelleher · Apr 22, 2026 · Citations: 0
MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation
Markus Knauer, Edoardo Fiorini, Maximilian Mühlbauer, Stefan Schneyer, Promwat Angsuratanawech · Apr 22, 2026 · Citations: 0
HaS: Accelerating RAG through Homology-Aware Speculative Retrieval
Peng Peng, Weiwei Lin, Wentai Wu, Xinyang Wang, Yongheng Liu · Apr 22, 2026 · Citations: 0
Detoxification for LLM: From Dataset Itself
Wei Shao, Yihang Wang, Gaoyu Zhu, Ziqiang Cheng, Lei Yu · Apr 21, 2026 · Citations: 0
Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems
Tianyi Huang, Samuel Xu, Jason Tansong Dang, Samuel Yan, Kimberley Yin · Apr 19, 2026 · Citations: 0
Cat-DPO: Category-Adaptive Safety Alignment
Tiankai Yang, Yi Nian, Xinyuan Li, Ruiyao Xu, Kaize Ding · Apr 19, 2026 · Citations: 0
Neurosymbolic Repo-level Code Localization
Xiufeng Xu, Xiufeng Wu, Zejun Zhang, Yi Li · Apr 17, 2026 · Citations: 0
Target-Oriented Pretraining Data Selection via Neuron-Activated Graph
Zijun Wang, Haoqin Tu, Weidong Zhou, Yiyang Zhou, Xiaohuan Zhou · Apr 17, 2026 · Citations: 0
Context Over Content: Exposing Evaluation Faking in Automated Judges
Manan Gupta, Inderjeet Nair, Lu Wang, Dhruv Kumar · Apr 16, 2026 · Citations: 0
An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2
Ryan Lail · Apr 15, 2026 · Citations: 0
IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages
Aviral Dawar, Roshan Karanth, Vikram Goyal, Dhruv Kumar · Apr 15, 2026 · Citations: 0
One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness
Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, Massoud Pedram · Apr 14, 2026 · Citations: 0
A Triadic Suffix Tokenization Scheme for Numerical Reasoning
Olga Chetverina · Apr 13, 2026 · Citations: 0
Think in Sentences: Explicit Sentence Boundaries Enhance Language Model's Capabilities
Zhichen Liu, Yongyuan Li, Yang Xu · Apr 11, 2026 · Citations: 0
Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions
Utshab Kumar Ghosh, Ashish David, Shubham Chatterjee · Apr 11, 2026 · Citations: 0
Sensitivity-Positional Co-Localization in GQA Transformers
Manoj Chandrashekar Rao · Apr 9, 2026 · Citations: 0
Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang · Apr 9, 2026 · Citations: 0
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
David Gringras · Apr 9, 2026 · Citations: 0
Ego-Grounding for Personalized Question-Answering in Egocentric Videos
Junbin Xiao, Shenglang Zhang, Pengxiang Zhu, Angela Yao · Apr 2, 2026 · Citations: 0
Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models
Chengyu Fang, Heng Guo, Zheng Jiang, Chunming He, Xiu Li · Mar 26, 2026 · Citations: 0
Mechanistically Interpreting Compression in Vision-Language Models
Veeraraju Elluru, Arth Singh, Roberto Aguero, Ajay Agarwal, Debojyoti Das · Mar 26, 2026 · Citations: 0
Off-Policy Value-Based Reinforcement Learning for Large Language Models
Peng-Yuan Wang, Ziniu Li, Tian Xu, Bohan Yang, Tian-Shuo Liu · Mar 24, 2026 · Citations: 0
RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue
Long Mai · Mar 24, 2026 · Citations: 0
Edge Radar Material Classification Under Geometry Shifts
Jannik Hohmann, Dong Wang, Andreas Nüchter · Mar 24, 2026 · Citations: 0
How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
Zixian Huang, Kaichen Yang, Xu Huang, Feiyang Hao, Qiming Ge · Mar 23, 2026 · Citations: 0
Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?
Richard J. Young · Mar 23, 2026 · Citations: 0
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna · Mar 18, 2026 · Citations: 0
IndexRAG: Bridging Facts for Cross-Document Reasoning at Index Time
Zhenghua Bao, Yi Shi · Mar 17, 2026 · Citations: 0
Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies
Giuseppe Samo, Paola Merlo · Mar 16, 2026 · Citations: 0
Attention Residuals
Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu · Mar 16, 2026 · Citations: 0
MXNorm: Reusing MXFP block scales for efficient tensor normalisation
Callum McLean, Luke Y. Prince, Alexandre Payot, Paul Balança, Carlo Luschi · Mar 13, 2026 · Citations: 0
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
Alliot Nagle, Jakhongir Saydaliev, Dhia Garbaya, Michael Gastpar, Ashok Vardhan Makkuva · Mar 13, 2026 · Citations: 0
LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories
Qianpu Sun, Xiaowei Chi, Yuhan Rui, Ying Li, Kuangzhi Ge · Mar 12, 2026 · Citations: 0
UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking
Chang Liu, Chuqiao Kuang, Tianyi Zhuang, Yuxin Cheng, Huichi Zhou · Mar 9, 2026 · Citations: 0
Dynamic Vehicle Routing Problem with Prompt Confirmation of Advance Requests
Amutheezan Sivagnanam, Ayan Mukhopadhyay, Samitha Samaranayake, Abhishek Dubey, Aron Laszka · Mar 8, 2026 · Citations: 0
Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning
Chuang Zhang, Zizhen Zhu, Yihao Wei, Bing Tian, Junyi Liu · Mar 4, 2026 · Citations: 0
CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning
Xinyu Zhu, Yihao Feng, Yanchao Sun, Xianzhi Du, Pingzhi Li · Mar 1, 2026 · Citations: 0
Polynomial Mixing for Efficient Self-supervised Speech Encoders
Eva Feillet, Ryan Whetten, David Picard, Alexandre Allauzen · Feb 28, 2026 · Citations: 0
Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs
Jie Cao, Tianwei Lin, Zhenxuan Fan, Bo Yuan, Ziyuan Zhao · Feb 28, 2026 · Citations: 0