Skip to content
← Back to explorer

Tag: Medicine

Filtered HFEPX paper feed.

Papers in tag: 115

Need Medicine evaluators for your project?

Post a Job →

Research Utility Snapshot

Evaluation Modes

  • Automatic Metrics (12)
  • Human Eval (1)
  • Llm As Judge (1)

Human Feedback Types

  • Expert Verification (11)
  • Pairwise Preference (4)
  • Rubric Rating (4)

Required Expertise

  • Medicine (20)
  • Coding (1)
  • Law (1)
Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models

Mikko Saukkoriipi, Nicole Hernandez, Jaakko Sahlsten, Kimmo Kaski, Otso Arponen · Mar 27, 2026 · Citations: 0

Expert Verification Automatic Metrics Medicine
  • Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients.
  • Clinical evaluation identified clinically significant errors in 2.9% of outputs, and semantically equivalent questions occasionally yielded discordant responses, including instances where one formulation was correct and the other contained…
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang, Zhoufutu Wen · Mar 27, 2026 · Citations: 0

Rubric RatingExpert Verification Automatic Metrics LawMedicine
  • To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains.
  • To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases.
ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory

Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang, Qing Li · Mar 27, 2026 · Citations: 0

Expert Verification Automatic Metrics Medicine
  • To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians.
  • Extensive experiments demonstrate that ClinicalAgents achieves state-of-the-art performance, significantly enhancing both diagnostic accuracy and explainability compared to strong single-agent and multi-agent baselines.
A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

Andong Tan, Shuyu Dai, Jinglu Wang, Fengtao Zhou, Yan Lu, Xi Wang · Mar 26, 2026 · Citations: 0

Expert Verification Human Eval Medicine
  • To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations.
  • To confirm the validity of our automatic analysis, we further conduct a comprehensive human evaluation involving 56 clinicians from different specialties.
OMIND: Framework for Knowledge Grounded Finetuning and Multi-Turn Dialogue Benchmark for Mental Health LLMs

Suraj Racha, Prashant Harish Joshi, Utkarsh Maurya, Nitin Yadav, Mridul Sharma, Ananya Kunisetty · Mar 26, 2026 · Citations: 0

Rubric Rating Automatic Metrics Medicine
  • We highlight three primary challenges for LLMs in mental health - lack of high quality interpretable and knowledge grounded training data; training paradigms restricted to core capabilities, and evaluation of multi turn dialogue settings.
  • Addressing it, we present oMind framework which includes training and aligning LLM agents for diverse capabilities including conversations; high quality ~164k multi-task SFT dataset, as a result of our generation pipeline based on…
Efficient Failure Management for Multi-Agent Systems with Reasoning Trace Representation

Lingzhe Zhang, Tong Jia, Mingyu Wang, Weijie Hong, Chiming Duan, Minghua He · Mar 23, 2026 · Citations: 0

Automatic Metrics Medicine
  • Large Language Models (LLM)-based Multi-Agent Systems (MASs) have emerged as a new paradigm in software system design, increasingly demonstrating strong reasoning and collaboration capabilities.
  • Building on this insight, we propose EAGER, an efficient failure management framework for multi-agent systems based on reasoning trace representation.
Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment

Mohamed Sobhi Jabal, Jikai Zhang, Dominic LaBella, Jessica L. Houk, Dylan Zhang, Jeffrey D. Rudie · Mar 23, 2026 · Citations: 0

Automatic Metrics Medicine
  • This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification.
  • The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4…
SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model

Guifeng Deng, Pan Wang, Jiquan Wang, Shuying Rao, Junyi Xie, Wanjun Guo · Mar 22, 2026 · Citations: 0

Expert Verification Automatic Metrics Medicine
  • Expert evaluations further validated the quality of the model's reasoning, with mean scores exceeding 4.0/5.0 for factual accuracy, evidence comprehensiveness, and logical coherence.
TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Arushi Rai, Qiang Zhang, Hanqing Zeng, Yunkai Zhang, Dipesh Tamboli, Xiangjun Fan · Mar 19, 2026 · Citations: 0

Pairwise Preference MathMedicine
  • Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning.
  • Furthermore, TARo also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimization to robust, cross-domain reasoning.
FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data

Mitul Goswami, Romit Chatterjee, Arif Ahmed Sekh · Mar 16, 2026 · Citations: 0

Expert Verification Automatic Metrics Medicine
  • Post-mitigation evaluation on seven clinically distinct cohorts derived from the MIMIC-IV-ED and eICU databases demonstrates substantial bias reduction: Statistical Parity Difference decreases by 40 to 51 percent on MIMIC-IV-ED and 10 to 19…
MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering

Shaowei Guan, Yu Zhai, Hin Chi Kwok, Jiawei Du, Xinyu Feng, Jing Li · Mar 15, 2026 · Citations: 0

Automatic Metrics Medicine
  • To fill this gap, we present MedPriv-Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering.
  • Through an extensive evaluation of 9 representative LLMs, we demonstrate a pervasive privacy-utility trade-off.
QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models

Yao Wu, Kangping Yin, Liang Dong, Zhenxin Ma, Shuting Xu, Xuehai Wang · Mar 14, 2026 · Citations: 0

Rubric Rating Automatic Metrics Medicine
  • To bridge this gap, we introduce QuarkMedBench, an ecologically valid benchmark tailored for real-world medical LLM assessment.
  • During evaluation, hierarchical weighting and safety constraints structurally quantify medical accuracy, key-point coverage, and risk interception, effectively mitigating the high costs and subjectivity of human grading.
Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese

Masataka Kawai, Singo Sakashita, Shumpei Ishikawa, Shogo Watanabe, Anna Matsuoka, Mikio Sakurai · Mar 12, 2026 · Citations: 0

Pairwise PreferenceExpert Verification Medicine
  • We evaluated seven open-source LLMs from three perspectives: (A) generation and information extraction of pathology diagnosis text following predefined formats, (B) correction of typographical errors in Japanese pathology reports, and (C)…
  • In contrast, preferences for explanatory outputs varied substantially across raters.
Stop Listening to Me! How Multi-turn Conversations Can Degrade LLM Diagnostic Reasoning

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong · Mar 12, 2026 · Citations: 0

Expert Verification Medicine
  • While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied.
  • Specifically, we develop a "stick-or-switch" evaluation framework to measure model conviction (i.e., defending a correct diagnosis or safe abstention against incorrect suggestions) and flexibility (i.e., recognizing a correct suggestion…
ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions

Monica Munnangi, Saiph Savage · Mar 11, 2026 · Citations: 0

Rubric Rating Llm As Judge Medicine
  • We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns.
  • We evaluate five state-of-the-art LLMs -- GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B -- on a stratified test split of 238 conversations (948 QA pairs) using a calibrated LLM-as-a-judge rubric grounded in physician…
On the Reliability of Cue Conflict and Beyond

Pum Jun Kim, Seung-Ah Lee, Seongho Park, Dongyoon Han, Jaejun Yoo · Mar 11, 2026 · Citations: 0

Pairwise Preference Medicine
  • Understanding how neural networks rely on visual cues offers a human-interpretable view of their internal decision processes.
  • We introduce REFINED-BIAS, an integrated dataset and evaluation framework for reliable and interpretable shape-texture bias diagnosis.
Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.