HFEPX Hub

Automatic Metrics + Medicine (Last 60 Days)

Updated from current HFEPX corpus (Mar 1, 2026). 12 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 12 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Adjudication. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 23, 2026.

Papers: 12 Last published: Feb 23, 2026 Global RSS Tag RSS

Automatic MetricsMedicineLast 60d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (12) Replication-Ready Only (0)

High-Signal Coverage

100.0%

12 / 12 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

0 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
2 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters (Expanded)

Why This Matters For Eval Research

58.3% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 100% of papers in this hub.
long-horizon tasks appears in 16.7% of papers, indicating agentic evaluation demand.

Protocol Notes (Expanded)

Protocol Takeaways

Most common quality-control signal is adjudication (8.3% of papers).
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Metric Interpretation

accuracy is reported in 50% of hub papers (6/12); compare with a secondary metric before ranking methods.
agreement is reported in 16.7% of hub papers (2/12); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (58.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (16.7% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Strong: Papers with known rater population

Coverage is strong (58.3% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (25% vs 35% target).

Strengths

Strong human-feedback signal (58.3% of papers).
Agentic evaluation appears in 33.3% of papers.

Known Gaps

Only 16.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Benchmark coverage is thin (0% of papers mention benchmarks/datasets).
LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Track metric sensitivity by reporting both accuracy and agreement.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

LLM-as-Judge Protocols Metric Slice: accuracy Recent High-Signal Papers

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Feb 23, 2026 · Citations: 0 · Score: 7.5

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: F1
Modeling Expert AI Diagnostic Alignment via Immutable Inference Snapshots
Feb 26, 2026 · Citations: 0 · Score: 7.5

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Agreement
Multi-Objective Alignment of Language Models for Personalized Psychotherapy
Feb 17, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference, Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Agreement
CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Feb 20, 2026 · Citations: 0 · Score: 6.0

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Precision
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Feb 25, 2026 · Citations: 0 · Score: 6.0

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Feb 25, 2026 · Citations: 0 · Score: 6.0

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models Feb 23, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	F1 , Precision	Gold Questions
Modeling Expert AI Diagnostic Alignment via Immutable Inference Snapshots Feb 26, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Agreement	Adjudication
Multi-Objective Alignment of Language Models for Personalized Psychotherapy Feb 17, 2026	Yes Pairwise Preference , Expert Verification	Automatic Metrics	Not Reported	Agreement , Cost	Not Reported
CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications Feb 20, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Precision , Recall	Not Reported
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models Feb 25, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Accuracy	Not Reported
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video Feb 25, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Accuracy	Not Reported
What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform Feb 19, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Accuracy	Not Reported
Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics Feb 23, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy , Cost	Not Reported
AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG Feb 22, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported
A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing Feb 15, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy , Bleu	Not Reported
Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation Feb 16, 2026	No Not Reported	Llm As Judge , Automatic Metrics	Not Reported	Bleu , Rouge	Not Reported
INSURE-Dial: A Phase-Aware Conversational Dataset & Benchmark for Compliance Verification and Phase Detection Jan 28, 2026	No Not Reported	Automatic Metrics	Not Reported	Latency	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	An artificial intelligence framework for end-to-end…	Modeling Expert AI Diagnostic Alignment via Immutab…	Multi-Objective Alignment of Language Models for Pe…
Human Feedback	Expert Verification	Expert Verification	Pairwise Preference, Expert Verification
Evaluation Modes	Automatic Metrics	Automatic Metrics	Automatic Metrics
Benchmarks	Not reported	Not reported	Not reported
Metrics	F1, Precision	Agreement	Agreement, Cost
Quality Controls	Gold Questions	Adjudication	Not reported
Rater Population	Domain Experts	Domain Experts	Domain Experts
Annotation Unit	Ranking	Unknown	Ranking

Research Utility Snapshot

Human Feedback Mix

Expert Verification (7)
Pairwise Preference (1)

Evaluation Modes

Automatic Metrics (12)
Llm As Judge (1)

Top Benchmarks

Top Metrics

Accuracy (6)
Agreement (2)
Bleu (2)
Cost (2)

Rater Population Mix

Domain Experts (7)

Quality Controls

Adjudication (1)
Gold Questions (1)

Coverage diagnostics (sample-based): human-feedback 58.3% · benchmarks 0.0% · metrics 100.0% · quality controls 16.7%.

Top Papers

An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram · Feb 23, 2026 · Citations: 0

Expert Verification Automatic Metrics

Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype…
Modeling Expert AI Diagnostic Alignment via Immutable Inference Snapshots
Dimitrios P. Panagoulias, Evangelia-Aikaterini Tsichrintzi, Georgios Savvidis, Evridiki Tsoureli-Nikita · Feb 26, 2026 · Citations: 0

Expert Verification Automatic Metrics

Human-in-the-loop validation is essential in safety-critical clinical AI, yet the transition between initial model inference and expert correction is rarely analyzed as a structured signal.
Multi-Objective Alignment of Language Models for Personalized Psychotherapy
Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli · Feb 17, 2026 · Citations: 0

Pairwise PreferenceExpert Verification Automatic Metrics

While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.
CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego · Feb 20, 2026 · Citations: 0

Expert Verification Automatic Metrics

The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets.
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0

Expert Verification Automatic Metrics

Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026 · Citations: 0

Expert Verification Automatic Metrics

Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform
Adrian Cosma, Cosmin Dumitrache, Emilian Radoi · Feb 19, 2026 · Citations: 0

Expert Verification Automatic Metrics

As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy.
Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics
Yue Pan, Xingyao Wang, Hanyue Zhang, Liwei Liu, Changxin Li · Feb 23, 2026 · Citations: 0

Automatic Metrics Long Horizon

The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings.
AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG
Qijie You, Wenkai Yu, Wentao Zhang · Feb 22, 2026 · Citations: 0

Automatic Metrics Long Horizon

With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction.
A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing
Naeimeh Nourmohammadi, Md Meem Hossain, The Anh Han, Safina Showkat Ara, Zia Ush Shamszaman · Feb 15, 2026 · Citations: 0

Automatic Metrics Multi Agent

We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability.
Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation
Shefayat E Shams Adib, Ahmed Alfey Sani, Ekramul Alam Esham, Ajwad Abrar, Tareque Mohmud Chowdhury · Feb 16, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics

We are using a zero-shot evaluation methodology and using BLEU and ROUGE metrics to evaluate performance without specialized fine-tuning.
INSURE-Dial: A Phase-Aware Conversational Dataset & Benchmark for Compliance Verification and Phase Detection
Shubham Kulkarni, Alexander Lyzhov, Preetam Joshi, Shiva Chaitanya · Jan 28, 2026 · Citations: 0

Automatic Metrics Web Browsing

We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote