HFEPX Metric Hub

Accuracy + Expert Verification Metric Papers

Updated from current HFEPX corpus (Mar 1, 2026). 8 papers are grouped in this metric page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 8 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequent quality control: Adjudication. Frequently cited benchmark: BIRD. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 15, 2026.

Papers: 8 Last published: Feb 15, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: Developing .

Metric Coverage

100.0%

8 sampled papers include metric names.

Benchmark Anchoring

25.0%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

37.5%

3 papers report calibration/adjudication/IAA controls.

8 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Treat this as directional signal only; metric reporting is present but benchmark anchoring is still thin.

Why This Matters (Expanded)

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 100% of papers in this hub.
BIRD is a recurring benchmark anchor for cross-paper comparisons in this page.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

Most common quality-control signal is adjudication (12.5% of papers).
Rater context is mostly domain experts, and annotation is commonly Freeform; use this to scope replication staffing.
Stratify by benchmark (BIRD vs Cricbench) before comparing methods.

Metric Interpretation

accuracy is reported in 100% of hub papers (8/8); compare with a secondary metric before ranking methods.
agreement is reported in 12.5% of hub papers (1/8); compare with a secondary metric before ranking methods.

Benchmark Context

BIRD appears in 12.5% of hub papers (1/8); use this cohort for benchmark-matched comparisons.
Cricbench appears in 12.5% of hub papers (1/8); use this cohort for benchmark-matched comparisons.

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Feb 15, 2026 · Citations: 0 · Score: 10.5

Metrics: Accuracy · Eval: Automatic Metrics
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Dec 26, 2025 · Citations: 0 · Score: 10.0

Metrics: Accuracy · Eval: Automatic Metrics
A Scalable Framework for Evaluating Health Language Models
Mar 30, 2025 · Citations: 0 · Score: 7.5

Metrics: Accuracy, Agreement · Eval: Automatic Metrics
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Feb 25, 2026 · Citations: 0 · Score: 7.5

Metrics: Accuracy · Eval: Automatic Metrics
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Feb 25, 2026 · Citations: 0 · Score: 7.5

Metrics: Accuracy · Eval: Automatic Metrics
What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform
Feb 19, 2026 · Citations: 0 · Score: 7.5

Metrics: Accuracy · Eval: Automatic Metrics

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam Feb 15, 2026	Accuracy	HLE	Automatic Metrics	Adjudication
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics Dec 26, 2025	Accuracy	DROP, BIRD	Automatic Metrics	Gold Questions
A Scalable Framework for Evaluating Health Language Models Mar 30, 2025	Accuracy, Agreement	Not reported	Automatic Metrics	Inter Annotator Agreement Reported
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models Feb 25, 2026	Accuracy	Not reported	Automatic Metrics	Not reported
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video Feb 25, 2026	Accuracy	Not reported	Automatic Metrics	Not reported
What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform Feb 19, 2026	Accuracy	Not reported	Automatic Metrics	Not reported
MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation Mar 23, 2025	Accuracy	Not reported	Automatic Metrics	Not reported
Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare Feb 22, 2025	Accuracy	Not reported	Automatic Metrics	Not reported

Researcher Workflow (Detailed)

Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Strong: Papers reporting quality controls

Coverage is strong (37.5% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Strong: Papers with known rater population

Coverage is strong (100% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (25% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).
Quality-control evidence appears in 37.5% of papers.

Known Gaps

No dominant metadata gap detected in current extraction coverage.

Suggested Next Analyses

Stratify by benchmark (BIRD vs Cricbench) before comparing methods.
Track metric sensitivity by reporting both accuracy and agreement.

Recommended Queries

Benchmark Slice: BIRD Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

No dominant metadata gap detected in current extraction coverage.
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Cross-page comparisons should be benchmark- and metric-matched to avoid protocol confounding.

Research Utility Snapshot (Detailed)

Top Metrics

Accuracy (8)
Agreement (1)
Cost (1)

Evaluation Modes

Automatic Metrics (8)

Top Benchmarks

BIRD (1)
Cricbench (1)
DROP (1)
HLE (1)

Agentic Mix

Top Papers Reporting This Metric

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li · Feb 15, 2026 · Citations: 0

Automatic Metrics Law

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0

Automatic Metrics CodingMultilingual

To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
A Scalable Framework for Evaluating Health Language Models
Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow · Mar 30, 2025 · Citations: 0

Automatic Metrics Medicine

As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.
Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare
Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman · Feb 22, 2025 · Citations: 0

Automatic Metrics Medicine

Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions.
MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation
Hsin-Ling Hsu, Cong-Tinh Dao, Luning Wang, Zitao Shuai, Thao Nguyen Minh Phan · Mar 23, 2025 · Citations: 0

Automatic Metrics Medicine

Comprehensive evaluation demonstrates that our method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0

Automatic Metrics MedicineCoding

Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026 · Citations: 0

Automatic Metrics MedicineCoding

Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform
Adrian Cosma, Cosmin Dumitrache, Emilian Radoi · Feb 19, 2026 · Citations: 0

Automatic Metrics Medicine

As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy.

Related Metric Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote