Shuang Zhou, Kai Yu, Song Wang, Wenya Xie, Zaifu Zhan, Meng-Han Tsai · Mar 11, 2026 · Citations: 0
Expert VerificationAutomatic MetricsMedicine
Here we present HeartAgent, a cardiology-specific agent system designed to support a reliable and explainable differential diagnosis.
Evaluated on the MIMIC dataset and a private electronic health records cohort, HeartAgent achieved over 36% and 20% improvements over established comparative methods, in top-3 diagnostic accuracy, respectively.
Ruiyang Ren, Yuhao Wang, Yunsen Liang, Lan Luo, Jing Liu, Haifeng Wang · Mar 11, 2026 · Citations: 0
Expert VerificationAutomatic MetricsMedicine
We developed DxEvolve, a self-evolving diagnostic agent that bridges these gaps through an interactive deep clinical research workflow.
On the MIMIC-CDM benchmark, DxEvolve improved diagnostic accuracy by 11.2% on average over backbone models and reached 90.4% on a reader-study subset, comparable to the clinician reference (88.8%).
Weixin Liu, Congning Ni, Qingyuan Song, Susannah L. Rose, Christopher Symons, Murat Kantarcioglu · Mar 11, 2026 · Citations: 0
Pairwise PreferenceLlm As JudgeMedicine
We introduce VERI-DPO, which uses claim verification to mine preferences and distill them into the summarizer with Direct Preference Optimization (DPO).
On held-out patients, verifier-mined preferences separate candidates by contradiction density, and VERI-DPO reduces Not Supported claim rates from 10.7% to 1.9% (local verifier judge) and from 11.6% to 6.4% (GPT-4o judge), while improving…
Zhongzhen Huang, Yan Ling, Hong Chen, Ye Feng, Li Wu, Linjie Mu · Mar 11, 2026 · Citations: 0
Expert VerificationAutomatic MetricsMedicine
We present PULSE, a medical reasoning agent that combines a domain-tuned large language model with scientific literature retrieval to support diagnostic decision-making in complex real-world cases.
To evaluate its capabilities, we curated a benchmark of 82 authentic endocrinology case reports encompassing a broad spectrum of disease types and incidence levels.
We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses.
Across 7 benchmarks and 5 task models, PEEM's accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate Spearman rho about 0.97, Pearson r about 0.94, p < 0.001).
Eeham Khan, Luis Rodriguez, Marc Queudot · Mar 10, 2026 · Citations: 0
DemonstrationsAutomatic MetricsMedicine
We evaluate this framework on the BioASQ and PubMedQA benchmarks, specifically analyzing the impact of dynamic in-context learning and rerank- ing under constrained token budgets.
Additionally, we perform a pilot study combining human expert assessment with LLM-based verification to explore how explicit rationale generation improves system transparency and enables more detailed diagnosis of retrieval failures in…
Yunhang Qian, Xiaobin Hu, Jiaquan Yu, Siyang Xin, Xiaokun Chen, Jiangning Zhang · Mar 10, 2026 · Citations: 0
Automatic MetricsMedicineCoding
While Multi-Agent Systems (MAS) show potential for complex clinical decision support, the field remains hindered by architectural fragmentation and the lack of standardized multimodal integration.
To address these challenges, we present MedMASLab, a unified framework and benchmarking platform for multimodal medical multi-agent systems.
Seunghwan Kim, Tiffany H. Kung, Heena Verma, Dilan Edirisinghe, Kaveh Sedehi, Johanna Alvarez · Mar 10, 2026 · Citations: 0
Expert VerificationAutomatic MetricsMedicine
Results: Against a human majority-vote standard (N=467), the agent achieved 95.8% emergency sensitivity and 88.5% sensitivity for all actionable alerts (85.7% specificity).
In LOO analysis, the agent outperformed every clinician in emergency sensitivity (97.5% vs.
Translating these systems into clinical practice requires assessment in real-world workflows with rigorous safety oversight.
We sought to assess the conversational safety and quality, patient and clinician experience, and clinical reasoning capabilities compared to primary care providers (PCPs).
Zhijun Wang, Ling Luo, Dinghao Pan, Huan Zhuang, Lejing Yu, Yuanyuan Sun · Mar 9, 2026 · Citations: 0
Automatic MetricsMedicineCoding
First, a multi-agent collaborative mechanism is utilized to automatically generate high-quality expert-like reasoning traces for supervised fine-tuning.
Additional evaluation on the DDI13 corpus confirms its generalizability to binary drugdrug interaction tasks.
Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Benoit Favre, Richard Dufour · Mar 6, 2026 · Citations: 0
Expert VerificationLlm As JudgeMedicine
Evaluation on open-ended QA combines automatic metrics, LLM-as-a-judge assessment, and human expert review; although LLM-based judgments correlate best with human ratings, they show sensitivity to verbosity.
Mohammed Baharoon, Thibault Heintz, Siavash Raissi, Mahmoud Alabbad, Mona Alhammad, Hassan AlOmaish · Mar 6, 2026 · Citations: 0
Pairwise PreferenceAutomatic MetricsMedicine
We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety.
CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board-certified radiologists in ReXVal (Kendalls tau = 0.61-0.71; Pearsons r = 0.71-0.84), and through two additional benchmarks that we…
Guoyi Li, Shihao Xu, Jiatong Ma, Yunyun Han, Jianhua Chen, Yafeng Deng · Mar 4, 2026 · Citations: 0
Rubric RatingAutomatic MetricsMedicine
Large language models (LLMs) have advanced medical dialogue systems, yet psychiatric consultation poses substantially higher demands due to subjective ambiguity and comorbidity complexity: an agent must continuously extract…
To avoid costly clinician labeling, we introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations.
We evaluate PrivMedChat across medical dialogue tasks and assess utility, safety, and privacy under consistent privacy accounting, thereby providing a practical pathway to align medical chatbots while offering formal privacy guarantees.
Yichi Zhang, Nabeel Seedat, Yinpeng Dong, Peng Cui, Jun Zhu, Mihaela van de Schaar · Mar 3, 2026 · Citations: 0
Expert VerificationAutomatic MetricsMedicine
As LLM-powered agents have been used for high-stakes decision-making, such as clinical diagnosis, it becomes critical to develop reliable verification of their decisions to facilitate trustworthy deployment.
We empirically validate GLEAN with agentic clinical diagnosis across three diseases from the MIMIC-IV dataset, surpassing the best baseline by 12% in AUROC and 50% in Brier score reduction, which confirms the effectiveness in both…
Minseok Choi, Dongjin Kim, Seungbin Yang, Subin Kim, Youngjun Kwak, Juyoung Oh · Mar 3, 2026 · Citations: 0
Expert VerificationLawMedicine
With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies.
Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial…
Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li, Chuanmiao Yan · Mar 2, 2026 · Citations: 0
Rubric RatingLlm As JudgeMedicine
However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows.
We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts.
Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva, Jesse J Hubbard · Mar 2, 2026 · Citations: 0
Rubric RatingExpert VerificationLlm As JudgeAutomatic MetricsMedicine
Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.
We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
Bian Sun, Zhenjian Wang, Orvill de la Torre, Zirui Wang · Feb 27, 2026 · Citations: 0
Llm As JudgeAutomatic MetricsMedicine
Due to the resource-intensive nature of large-scale human validation, the model's performance was evaluated through a dual-track framework: Track A utilized traditional lexical similarity metrics (e.g., BLEU, ROUGE), while Track B employed…
Consequently, we propose that while automated metrics and LLM judges serve as valuable developmental proxies, rigorous validation by human medical experts remains an indispensable requirement for the safe deployment of LLMs in healthcare…
Get Started
Join the #1 Platform for AI Training Talent
Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.