HFEPX Metric Hub

Accuracy & Pass Rate Metric Papers + Expert Verification

Updated from current HFEPX corpus (Mar 1, 2026). 10 papers are grouped in this metric page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 10 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Adjudication. Frequently cited benchmark: Ad-Bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 15, 2026.

Papers: 10 Last published: Feb 15, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: Developing .

Metric Coverage

100.0%

10 sampled papers include metric names.

Benchmark Anchoring

30.0%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

30.0%

3 papers report calibration/adjudication/IAA controls.

10 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Treat this as directional signal only; metric reporting is present but benchmark anchoring is still thin.

Why This Matters (Expanded)

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 90% of papers in this hub.
Ad-Bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

Most common quality-control signal is adjudication (10% of papers).
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Stratify by benchmark (Ad-Bench vs BIRD) before comparing methods.

Metric Interpretation

accuracy is reported in 80% of hub papers (8/10); compare with a secondary metric before ranking methods.
pass@1 is reported in 20% of hub papers (2/10); compare with a secondary metric before ranking methods.

Benchmark Context

Ad-Bench appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.
BIRD appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Feb 15, 2026 · Citations: 0 · Score: 10.5

Metrics: Accuracy · Eval: Automatic Metrics
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Dec 26, 2025 · Citations: 0 · Score: 10.0

Metrics: Accuracy · Eval: Automatic Metrics
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Feb 15, 2026 · Citations: 0 · Score: 9.0

Metrics: Pass@1, Pass@3 · Eval: Simulation Env
A Scalable Framework for Evaluating Health Language Models
Mar 30, 2025 · Citations: 0 · Score: 7.5

Metrics: Accuracy, Agreement · Eval: Automatic Metrics
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Feb 25, 2026 · Citations: 0 · Score: 7.5

Metrics: Accuracy · Eval: Automatic Metrics
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Feb 25, 2026 · Citations: 0 · Score: 7.5

Metrics: Accuracy · Eval: Automatic Metrics

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam Feb 15, 2026	Accuracy	HLE	Automatic Metrics	Adjudication
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics Dec 26, 2025	Accuracy	DROP, BIRD	Automatic Metrics	Gold Questions
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents Feb 15, 2026	Pass@1, Pass@3	Ad Bench	Simulation Env	Not reported
A Scalable Framework for Evaluating Health Language Models Mar 30, 2025	Accuracy, Agreement	Not reported	Automatic Metrics	Inter Annotator Agreement Reported
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models Feb 25, 2026	Accuracy	Not reported	Automatic Metrics	Not reported
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video Feb 25, 2026	Accuracy	Not reported	Automatic Metrics	Not reported
What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform Feb 19, 2026	Accuracy	Not reported	Automatic Metrics	Not reported
APEX-Agents Jan 20, 2026	Pass@1	Not reported	Automatic Metrics	Not reported
MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation Mar 23, 2025	Accuracy	Not reported	Automatic Metrics	Not reported
Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare Feb 22, 2025	Accuracy	Not reported	Automatic Metrics	Not reported

Researcher Workflow (Detailed)

Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Strong: Papers reporting quality controls

Coverage is strong (30% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (30% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Strong: Papers with known rater population

Coverage is strong (100% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (40% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).
Quality-control evidence appears in 30% of papers.
Rater population and annotation-unit details are frequently specified.

Known Gaps

No dominant metadata gap detected in current extraction coverage.

Suggested Next Analyses

Stratify by benchmark (Ad-Bench vs BIRD) before comparing methods.
Track metric sensitivity by reporting both accuracy and pass@1.

Recommended Queries

Benchmark Slice: Ad-Bench Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

No dominant metadata gap detected in current extraction coverage.
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Cross-page comparisons should be benchmark- and metric-matched to avoid protocol confounding.

Research Utility Snapshot (Detailed)

Top Metrics

Accuracy (8)
Pass@1 (2)
Agreement (1)
Cost (1)

Evaluation Modes

Automatic Metrics (9)
Simulation Env (1)

Top Benchmarks

Ad Bench (1)
BIRD (1)
Cricbench (1)
DROP (1)

Agentic Mix

Long Horizon (2)

Top Papers Reporting This Metric

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu · Feb 15, 2026 · Citations: 0

Simulation Env Coding

While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem.
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li · Feb 15, 2026 · Citations: 0

Automatic Metrics Law

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0

Automatic Metrics CodingMultilingual

To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
A Scalable Framework for Evaluating Health Language Models
Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow · Mar 30, 2025 · Citations: 0

Automatic Metrics Medicine

As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.
APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0

Automatic Metrics Law

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate…
Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare
Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman · Feb 22, 2025 · Citations: 0

Automatic Metrics Medicine

Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions.
MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation
Hsin-Ling Hsu, Cong-Tinh Dao, Luning Wang, Zitao Shuai, Thao Nguyen Minh Phan · Mar 23, 2025 · Citations: 0

Automatic Metrics Medicine

Comprehensive evaluation demonstrates that our method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0

Automatic Metrics MedicineCoding

Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026 · Citations: 0

Automatic Metrics MedicineCoding

Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform
Adrian Cosma, Cosmin Dumitrache, Emilian Radoi · Feb 19, 2026 · Citations: 0

Automatic Metrics Medicine

As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy.

Related Metric Hubs

Accuracy + Expert Verification Metric Papers Expert Verification Papers Expert Verification Or Rlaif Or Synthetic Feedback Papers Expert Verification Or Rubric Rating Papers Demonstrations Or Expert Verification Papers Critique Edit Or Expert Verification Papers Accuracy & Pass Rate Metric Papers (88) Accuracy Metric Papers (82) Accuracy & Pass Rate Metric Papers In CS.CL (63) Accuracy & Pass Rate Metric Papers + Automatic Metrics (74) Accuracy In CS.CL Papers (58) Accuracy & Pass Rate Metric Papers In CS.AI (58) Accuracy + Automatic Metrics Metric Papers (70) Accuracy & Pass Rate Metric Papers + General (42) Accuracy + General Metric Papers (40) Accuracy & Pass Rate Metric Papers + Long Horizon (30) Accuracy + Automatic Metrics Metric Papers (Last 120 Days) (53) Accuracy + Automatic Metrics Metric Papers (Last 90 Days) (51) Accuracy + Automatic Metrics Metric Papers (Last 30 Days) (47) Accuracy + Automatic Metrics Metric Papers (Last 45 Days) (47) Accuracy + Automatic Metrics Metric Papers (Last 60 Days) (47) Accuracy + General Metric Papers (Last 120 Days) (31) Accuracy + General Metric Papers (Last 90 Days) (29) Accuracy + General Metric Papers (Last 30 Days) (27)

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote