HFEPX Hub

Automatic Metrics + Expert Verification Papers

Updated from current HFEPX corpus (Feb 27, 2026). 19 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Gold Questions. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 25, 2026.

Papers: 19 Last published: Feb 25, 2026 Global RSS Tag RSS

Automatic MetricsExpert Verification

Research Narrative

Grounded narrative Model: deterministic-grounded

Updated from current HFEPX corpus (Feb 27, 2026). This page covers 19 papers centered on Automatic Metrics + Expert Verification Papers. Common evaluation modes include Automatic Metrics, with benchmark emphasis on Retrieval, BIRD. Metric concentration includes accuracy, cost, and the agentic footprint highlights Multi Agent, Tool Use. Use the anchored takeaways below to compare protocol choices, quality-control patterns, and evidence depth before allocating new eval budget.

Why This Matters For Eval Research

Evaluation emphasis: Automatic Metrics appear frequently in this slice.

Evidence: MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models , SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video , SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
Benchmark concentration: Retrieval, BIRD helps control cross-paper variance.

Evidence: SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video , SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery , "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Metric concentration: accuracy, cost is repeatedly reported in this group.

Evidence: SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery , "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems , An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems

Protocol Takeaways

Stratify by benchmark (Retrieval vs BIRD) before comparing methods.

Evidence: "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems , An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems , An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Track metric sensitivity by reporting both accuracy and cost.

Evidence: An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems , An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation
Papers with explicit human feedback is visible in approximately 100% of papers in this set.

Evidence: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation , CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

Benchmark Interpretation

Retrieval appears as a recurring benchmark anchor in this page.
2 papers (10.5%) mention Retrieval.
Most common evaluation modes: Automatic Metrics.

Metric Interpretation

accuracy is a common reported metric and should be paired with protocol context before ranking methods.
8 papers (42.1%) mention accuracy.
Most common evaluation modes: Automatic Metrics.

Researcher Checklist

Papers with explicit human feedback: Coverage is strong (100% vs 45% target).
Papers reporting quality controls: Coverage is usable but incomplete (21.1% vs 30% target).
Papers naming benchmarks/datasets: Coverage is usable but incomplete (31.6% vs 35% target).
Papers naming evaluation metrics: Coverage is strong (84.2% vs 35% target).
Papers with known rater population: Coverage is strong (100% vs 35% target).
Papers with known annotation unit: Coverage is usable but incomplete (26.3% vs 35% target).

Papers with explicit human feedback

Coverage is strong (100% vs 45% target).

Papers reporting quality controls

Coverage is usable but incomplete (21.1% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (31.6% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (84.2% vs 35% target).

Papers with known rater population

Coverage is strong (100% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (26.3% vs 35% target).

Known Limitations

Narrative synthesis is grounded in metadata and abstracts only; full-paper method details may be missing.
Extraction fields are conservative and can under-report implicit protocol details.
Cross-page comparisons should control for benchmark and metric mismatch.

Research Utility Links

Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

Top Papers

MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0

Expert Verification Automatic Metrics

Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026 · Citations: 0

Expert Verification Automatic Metrics

Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
"Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics

Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare.
An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems
Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, Harmanpreet Kaur · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics

Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practic
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram · Feb 23, 2026 · Citations: 0

Expert Verification Automatic Metrics

Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype Ontolo
Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation
Rizhuo Huang, Yifan Feng, Rundong Xue, Shihui Ying, Jun-Hai Yong · Feb 23, 2026 · Citations: 0

Expert Verification Automatic Metrics

Additionally, we present \textbf{HyperDocRED}, a rigorously annotated benchmark for document-level knowledge hypergraph extraction.
CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego · Feb 20, 2026 · Citations: 0

Expert Verification Automatic Metrics

The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets.
What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform
Adrian Cosma, Cosmin Dumitrache, Emilian Radoi · Feb 19, 2026 · Citations: 0

Expert Verification Automatic Metrics

As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy.
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models.
Multi-Objective Alignment of Language Models for Personalized Psychotherapy
Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli · Feb 17, 2026 · Citations: 0

Pairwise PreferenceExpert Verification Automatic Metrics

While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li · Feb 15, 2026 · Citations: 0

Expert VerificationCritique Edit Automatic Metrics

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0

Expert Verification Automatic Metrics

To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise
Nitin Sharma, Thomas Wolfers, Çağatay Yıldız · Jun 9, 2025 · Citations: 0

Expert Verification Automatic Metrics

Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education.
HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang · Jun 4, 2025 · Citations: 0

Expert Verification Automatic Metrics

However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences
A Scalable Framework for Evaluating Health Language Models
Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow · Mar 30, 2025 · Citations: 0

Rubric RatingExpert Verification Automatic Metrics

As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.
MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation
Hsin-Ling Hsu, Cong-Tinh Dao, Luning Wang, Zitao Shuai, Thao Nguyen Minh Phan · Mar 23, 2025 · Citations: 0

Expert Verification Automatic Metrics

Comprehensive evaluation demonstrates that our method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.
Measuring AI Ability to Complete Long Software Tasks
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia · Mar 18, 2025 · Citations: 0

Expert Verification Automatic Metrics Tool Use

Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare
Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman · Feb 22, 2025 · Citations: 0

Pairwise PreferenceExpert Verification Automatic Metrics

Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions.