- CounselReflect: A Toolkit for Auditing Mental-Health Dialogues
Yahan Li, Chaohao Du, Zeyang Li, Christopher Chun Kuizon, Shupeng Cheng · Mar 31, 2026 · Citations: 0
Rubric RatingExpert Verification Human Eval Web Browsing
The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined…
- PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Yiqing Zhang, Xiaozhong Liu, Fabricio Murai · Mar 28, 2026 · Citations: 0
Expert Verification Llm As JudgeAutomatic Metrics
In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
- Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning
Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner · Mar 29, 2026 · Citations: 0
Expert Verification Human EvalAutomatic Metrics Multi Agent
In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded.
- SODIUM: From Open Web Data to Queryable Databases
Chuxuan Hu, Philip Li, Maxwell Yang, Daniel Kang · Mar 19, 2026 · Citations: 0
Expert Verification Automatic Metrics Multi Agent
Existing systems struggle with SODIUM tasks: we evaluate 6 advanced AI agents on SODIUM-Bench, with the strongest baseline achieving only 46.5% accuracy.
- PRBench: End-to-end Paper Reproduction in Physics Research
Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu · Mar 29, 2026 · Citations: 0
Rubric RatingExpert Verification Automatic MetricsSimulation Env
We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics.
- Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang · Mar 27, 2026 · Citations: 0
Rubric RatingExpert Verification Automatic Metrics
To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains.
- A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations
Andong Tan, Shuyu Dai, Jinglu Wang, Fengtao Zhou, Yan Lu · Mar 26, 2026 · Citations: 0
Expert Verification Human Eval
To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations.
- A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models
Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan · Apr 7, 2026 · Citations: 0
Expert Verification Automatic Metrics
Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale.
- Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models
Mikko Saukkoriipi, Nicole Hernandez, Jaakko Sahlsten, Kimmo Kaski, Otso Arponen · Mar 27, 2026 · Citations: 0
Expert Verification Automatic Metrics
Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients.
- SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model
Guifeng Deng, Pan Wang, Jiquan Wang, Shuying Rao, Junyi Xie · Mar 22, 2026 · Citations: 0
Expert Verification Automatic Metrics
Expert evaluations further validated the quality of the model's reasoning, with mean scores exceeding 4.0/5.0 for factual accuracy, evidence comprehensiveness, and logical coherence.
- RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
Ayush Garg, Sophia Hager, Jacob Montiel, Aditya Tiwari, Michael Gentile · Apr 2, 2026 · Citations: 0
Expert Verification Llm As JudgeAutomatic Metrics
This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic…
- ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory
Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang · Mar 27, 2026 · Citations: 0
Expert Verification Automatic Metrics Multi Agent
To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians.
- A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment
Sheng Liu, Long Chen, Zeyun Zhao, Qinglin Gou, Qingyue Wei · Mar 23, 2026 · Citations: 0
Expert Verification Automatic Metrics Multi Agent
We present Cerebra, an interactive multi-agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis.
- EpiScreen: Early Epilepsy Detection from Electronic Health Records with Large Language Models
Shuang Zhou, Kai Yu, Zaifu Zhan, Huixue Zhou, Min Zeng · Mar 30, 2026 · Citations: 0
Expert Verification
In a clinician-AI collaboration setting, EpiScreen-assisted neurologists outperformed unaided experts by up to 10.9%.
- Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning
Navan Preet Singh, Xiaokun Wang, Anurag Garikipati, Madalina Ciobanu, Qingqing Mao · Apr 7, 2026 · Citations: 0
Expert Verification Automatic Metrics
These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly…
- Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering
Elyas Irankhah, Samah Fodeh · Apr 8, 2026 · Citations: 0
Expert Verification Automatic Metrics
Third, results on the development set show that alignment accuracy is mainly limited by reasoning.
- Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy
Ruijie Yang, Yan Zhu, Peiyao Fu, Te Luo, Zhihua Wang · Apr 2, 2026 · Citations: 0
Expert Verification Automatic Metrics
Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic…
- Learning Diagnostic Reasoning for Decision Support in Toxicology
Nico Oberländer, David Bani-Harouni, Tobias Zellner, Nassir Navab, Florian Eyer · Mar 31, 2026 · Citations: 0
Expert Verification Automatic Metrics
To address this, we present DeToxR (Decision-support for Toxicology with Reasoning), the first adaptation of Reinforcement Learning (RL) to emergency toxicology.
- LLM-Powered Workflow Optimization for Multidisciplinary Software Development: An Automotive Industry Case Study
Shuai Wang, Yinan Yu, Earl Barr, Dhasarathy Parthasarathy · Mar 22, 2026 · Citations: 0
Expert Verification Automatic Metrics
We evaluate our approach on spapi, a production in-vehicle API system at Volvo Group involving 192 endpoints, 420 properties, and 776 CAN signals across six functional domains.
- Calibrated Confidence Expression for Radiology Report Generation
David Bani-Harouni, Chantal Pellegrini, Julian Lüers, Su Hwan Kim, Markus Baalmann · Mar 31, 2026 · Citations: 0
Expert Verification
In a clinical evaluation we show that ConRad's report level scores are well aligned with clinicians' judgment.
- Generating and Evaluating Sustainable Procurement Criteria for the Swiss Public Sector using In-Context Prompting with Large Language Models
Yingqiang Gao, Veton Matoshi, Luca Rolshoven, Tilia Ellendorff, Judith Binder · Mar 23, 2026 · Citations: 0
Expert Verification
Swiss law requires the integration of ecological, social, and economic sustainability requirements into tender evaluations in the format of criteria that have to be fulfilled by a bidder.
- Training-Free Dynamic Upcycling of Expert Language Models
Eros Fanì, Oğuzhan Ersoy · Mar 31, 2026 · Citations: 0
Expert Verification
To address these issues, we introduce Dynamic Upcycling MoE (DUME), a novel approach that reuses dense experts trained on different domains to construct a unified MoE model.
- Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts
Haolei Xu, Haiwen Hong, Hongxing Li, Rui Zhou, Yang Zhang · Apr 9, 2026 · Citations: 0
Expert Verification
Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks.
- Selecting Decision-Relevant Concepts in Reinforcement Learning
Naveen Raman, Stephanie Milani, Fei Fang · Apr 6, 2026 · Citations: 0
Expert Verification
Training interpretable concept-based policies requires practitioners to manually select which human-understandable concepts an agent should reason with when making sequential decisions.
- FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models
Juyong Jiang, Fan Wang, Hong Qi, Sunghun Kim, Jing Tang · Apr 2, 2026 · Citations: 0
Expert Verification
Extensive evaluations across 28 benchmarks, multiple model architectures, and scales demonstrate that FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings while using significantly fewer…
- Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging
Mengxian Lyu, Cheng Peng, Ziyi Chen, Mengyuan Zhang, Jieting Li Lu · Apr 2, 2026 · Citations: 0
Expert Verification
Comprehensive evaluation across medical benchmarks and five clinical generation tasks (e.g., radiology and discharge summarization) shows that merged models can effectively mitigate catastrophic forgetting, preserve clinical domain…
- Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning
Mohammad R. Abu Ayyash · Apr 1, 2026 · Citations: 0
Expert Verification
We present Brainstacks, a modular architecture for continual multi-domain fine-tuning of large language models that packages domain expertise as frozen adapter stacks composing additively on a shared frozen base at inference.
- A Survey of On-Policy Distillation for Large Language Models
Mingyang Song, Mao Zheng · Apr 1, 2026 · Citations: 0
Expert VerificationDemonstrations
We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.
- To Write or to Automate Linguistic Prompts, That Is the Question
Marina Sánchez-Torrón, Daria Akselrod, Jason Rauchwerk · Mar 26, 2026 · Citations: 0
Expert Verification
We present the first systematic comparison of hand-crafted zero-shot expert prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across translation, terminology insertion, and language quality assessment, evaluating five model…