HFEPX Benchmark Hub

DROP Benchmark Papers (Last 45 Days)

Updated from current HFEPX corpus (Mar 17, 2026). 11 papers are grouped in this benchmark page.

Read Full Context

Updated from current HFEPX corpus (Mar 17, 2026). 11 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Frequently cited benchmark: DROP. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 16, 2026.

Papers: 11 Last published: Mar 16, 2026 Global RSS

Researcher Quick Triage

Use this page for benchmark-matched method comparisons and eval protocol selection. Quality band: Developing .

High-Signal Coverage

100.0%

11 / 11 sampled papers are not low-signal flagged.

Replication-Ready Set

Papers with explicit benchmark + metric + eval mode fields.

Quality Controls

0.0%

0 papers report calibration/adjudication/IAA controls.

3 papers explicitly name benchmark datasets in the sampled set.
2 papers report at least one metric term in metadata extraction.
Start with the ranked shortlist below before reading all papers.

Primary action: Use this page to map benchmark mentions first; wait for stronger metric/QC coverage before strict comparisons.

Why This Matters (Expanded)

Why This Matters For Eval Research

18.2% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 18.2% of papers in this hub.
DROP is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.
Stratify by benchmark (DROP vs BrowseComp) before comparing methods.

Benchmark Interpretation

DROP appears in 100% of hub papers (11/11); use this cohort for benchmark-matched comparisons.
BrowseComp appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 45.5% of hub papers (5/11); compare with a secondary metric before ranking methods.
cost is reported in 18.2% of hub papers (2/11); compare with a secondary metric before ranking methods.

Start Here (Benchmark-Matched First 6)

Ranked by protocol completeness so you can quickly find papers suitable for comparison studies.

FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data
Mar 16, 2026 · Citations: 0 · Score: 8.5

Eval: Automatic Metrics · Metrics: Accuracy
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Feb 3, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy
ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays
Feb 26, 2026 · Citations: 0 · Score: 6.5

Eval: Not reported · Metrics: Not Reported
MXNorm: Reusing MXFP block scales for efficient tensor normalisation
Mar 13, 2026 · Citations: 0 · Score: 0.0

Eval: Not reported · Metrics: Not Reported
LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories
Mar 12, 2026 · Citations: 0 · Score: 0.0

Eval: Not reported · Metrics: Not Reported
UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking
Mar 9, 2026 · Citations: 0 · Score: 0.0

Eval: Not reported · Metrics: Not Reported

Protocol Matrix (Top 10)

Compare protocol ingredients quickly before deep-reading full papers.

Paper	Eval Modes	Human Feedback	Metrics	Quality Controls
FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data Mar 16, 2026	Automatic Metrics	Expert Verification	Accuracy, Auroc	Not reported
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild? Feb 3, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays Feb 26, 2026	Not reported	Pairwise Preference	Not reported	Not reported
MXNorm: Reusing MXFP block scales for efficient tensor normalisation Mar 13, 2026	Not reported	Not reported	Not reported	Not reported
LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories Mar 12, 2026	Not reported	Not reported	Not reported	Not reported
UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking Mar 9, 2026	Not reported	Not reported	Not reported	Not reported
Dynamic Vehicle Routing Problem with Prompt Confirmation of Advance Requests Mar 8, 2026	Not reported	Not reported	Not reported	Not reported
Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning Mar 4, 2026	Not reported	Not reported	Not reported	Not reported
Polynomial Mixing for Efficient Self-supervised Speech Encoders Feb 28, 2026	Not reported	Not reported	Not reported	Not reported
Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs Feb 28, 2026	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (18.2% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Strong: Papers naming benchmarks/datasets

Coverage is strong (100% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (63.6% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (9.1% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Strengths

Most papers provide measurable evaluation context (100% benchmarks, 63.6% metrics).

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.1% coverage).
Annotation unit is under-specified (0% coverage).

Suggested Next Analyses

Stratify by benchmark (DROP vs BrowseComp) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Benchmark Slice: DROP Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.1% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (2)

Human Feedback Mix

Expert Verification (1)
Pairwise Preference (1)

Top Benchmarks

DROP (11)
BrowseComp (1)
GAIA (1)
MATH 500 (1)

Top Metrics

Accuracy (5)
Cost (2)
Precision (2)
Auroc (1)

Top Papers On This Benchmark

FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data
Mitul Goswami, Romit Chatterjee, Arif Ahmed Sekh · Mar 16, 2026 · Citations: 0

Expert Verification Automatic Metrics

Post-mitigation evaluation on seven clinically distinct cohorts derived from the MIMIC-IV-ED and eICU databases demonstrates substantial bias reduction: Statistical Parity Difference decreases by 40 to 51 percent on MIMIC-IV-ED and 10 to 19…
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar · Feb 3, 2026 · Citations: 0

Automatic Metrics

To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts.
ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays
Aishik Sanyal · Feb 26, 2026 · Citations: 0

Pairwise Preference

Inspired by Humphrey's ipsundrum hypothesis, we implement ReCoN-Ipsundrum, an inspectable agent that extends a ReCoN state machine with a recurrent persistence loop over sensory salience Ns and an optional affect proxy reporting…
MXNorm: Reusing MXFP block scales for efficient tensor normalisation
Callum McLean, Luke Y. Prince, Alexandre Payot, Paul Balança, Carlo Luschi · Mar 13, 2026 · Citations: 0
LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories
Qianpu Sun, Xiaowei Chi, Yuhan Rui, Ying Li, Kuangzhi Ge · Mar 12, 2026 · Citations: 0
UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking
Chang Liu, Chuqiao Kuang, Tianyi Zhuang, Yuxin Cheng, Huichi Zhou · Mar 9, 2026 · Citations: 0
Dynamic Vehicle Routing Problem with Prompt Confirmation of Advance Requests
Amutheezan Sivagnanam, Ayan Mukhopadhyay, Samitha Samaranayake, Abhishek Dubey, Aron Laszka · Mar 8, 2026 · Citations: 0
Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning
Chuang Zhang, Zizhen Zhu, Yihao Wei, Bing Tian, Junyi Liu · Mar 4, 2026 · Citations: 0
Polynomial Mixing for Efficient Self-supervised Speech Encoders
Eva Feillet, Ryan Whetten, David Picard, Alexandre Allauzen · Feb 28, 2026 · Citations: 0
Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs
Jie Cao, Tianwei Lin, Zhenxuan Fan, Bo Yuan, Ziyuan Zhao · Feb 28, 2026 · Citations: 0
Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
Qihua Dong, Kuo Yang, Lin Ju, Handong Zhao, Yitian Zhang · Feb 27, 2026 · Citations: 0

Related Benchmark Hubs

DROP Benchmark Papers (Last 60 Days) DROP Benchmark Papers (Last 75 Days) DROP Benchmark Papers (Last 120 Days) DROP Benchmark Papers (Last 150 Days) DROP Benchmark Papers (Last 180 Days) DROP Benchmark Papers (Last 90 Days) DROP Benchmark Papers (Last 60 Days) (11) DROP Benchmark Papers (Last 75 Days) (11) Reasoning & Math Suite Benchmark Papers + Automatic Metrics (10) Reasoning & Math Suite Benchmark Papers (23) Reasoning & Math Suite Benchmark Papers In CS.CL (21) Reasoning & Math Suite Benchmark Papers In CS.AI (15) Coding Evaluation Suite Benchmark Papers (11) Coding Evaluation Suite Benchmark Papers In CS.CL (10) SWE-Bench Ecosystem Benchmark Papers (13) SWE-Bench Ecosystem Benchmark Papers In CS.CL (11)

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote