- PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Yiqing Zhang, Xiaozhong Liu, Fabricio Murai · Mar 28, 2026 · Citations: 0
Expert Verification Llm As JudgeAutomatic Metrics
In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
- Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning
Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner · Mar 29, 2026 · Citations: 0
Expert Verification Human EvalAutomatic Metrics Multi Agent
In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded.
- SODIUM: From Open Web Data to Queryable Databases
Chuxuan Hu, Philip Li, Maxwell Yang, Daniel Kang · Mar 19, 2026 · Citations: 0
Expert Verification Automatic Metrics Multi Agent
Existing systems struggle with SODIUM tasks: we evaluate 6 advanced AI agents on SODIUM-Bench, with the strongest baseline achieving only 46.5% accuracy.
- PRBench: End-to-end Paper Reproduction in Physics Research
Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu · Mar 29, 2026 · Citations: 0
Rubric RatingExpert Verification Automatic MetricsSimulation Env
We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics.
- Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang · Mar 27, 2026 · Citations: 0
Rubric RatingExpert Verification Automatic Metrics
To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains.
- A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models
Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan · Apr 7, 2026 · Citations: 0
Expert Verification Automatic Metrics
Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale.
- Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models
Mikko Saukkoriipi, Nicole Hernandez, Jaakko Sahlsten, Kimmo Kaski, Otso Arponen · Mar 27, 2026 · Citations: 0
Expert Verification Automatic Metrics
Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients.
- SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model
Guifeng Deng, Pan Wang, Jiquan Wang, Shuying Rao, Junyi Xie · Mar 22, 2026 · Citations: 0
Expert Verification Automatic Metrics
Expert evaluations further validated the quality of the model's reasoning, with mean scores exceeding 4.0/5.0 for factual accuracy, evidence comprehensiveness, and logical coherence.
- RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
Ayush Garg, Sophia Hager, Jacob Montiel, Aditya Tiwari, Michael Gentile · Apr 2, 2026 · Citations: 0
Expert Verification Llm As JudgeAutomatic Metrics
This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic…
- ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory
Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang · Mar 27, 2026 · Citations: 0
Expert Verification Automatic Metrics Multi Agent
To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians.
- A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment
Sheng Liu, Long Chen, Zeyun Zhao, Qinglin Gou, Qingyue Wei · Mar 23, 2026 · Citations: 0
Expert Verification Automatic Metrics Multi Agent
We present Cerebra, an interactive multi-agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis.
- Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning
Navan Preet Singh, Xiaokun Wang, Anurag Garikipati, Madalina Ciobanu, Qingqing Mao · Apr 7, 2026 · Citations: 0
Expert Verification Automatic Metrics
These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly…
- Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering
Elyas Irankhah, Samah Fodeh · Apr 8, 2026 · Citations: 0
Expert Verification Automatic Metrics
Third, results on the development set show that alignment accuracy is mainly limited by reasoning.
- Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy
Ruijie Yang, Yan Zhu, Peiyao Fu, Te Luo, Zhihua Wang · Apr 2, 2026 · Citations: 0
Expert Verification Automatic Metrics
Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic…
- Learning Diagnostic Reasoning for Decision Support in Toxicology
Nico Oberländer, David Bani-Harouni, Tobias Zellner, Nassir Navab, Florian Eyer · Mar 31, 2026 · Citations: 0
Expert Verification Automatic Metrics
To address this, we present DeToxR (Decision-support for Toxicology with Reasoning), the first adaptation of Reinforcement Learning (RL) to emergency toxicology.
- LLM-Powered Workflow Optimization for Multidisciplinary Software Development: An Automotive Industry Case Study
Shuai Wang, Yinan Yu, Earl Barr, Dhasarathy Parthasarathy · Mar 22, 2026 · Citations: 0
Expert Verification Automatic Metrics
We evaluate our approach on spapi, a production in-vehicle API system at Volvo Group involving 192 endpoints, 420 properties, and 776 CAN signals across six functional domains.