Tag: Medicine

Medicine evaluation papers that call for domain expertise or specialist review (115 papers).

Papers in tag: 115

Tag RSS Global RSS

Need Medicine evaluators for your project?

Post a Job →

Research Utility Snapshot

Evaluation Modes

Automatic Metrics (12)
Llm As Judge (1)
Simulation Env (1)

Human Feedback Types

Expert Verification (5)
Pairwise Preference (3)
Rubric Rating (3)

Required Expertise

Medicine (20)
Coding (1)
Multilingual (1)

OMGs: A multi-agent system supporting MDT decision-making across the ovarian tumour care continuum

Yangyang Zhang, Zilong Wang, Jianbo Xu, Yongqi Chen, Chu Han, Zhihao Zhang · Feb 14, 2026 · Citations: 0

Expert Verification Medicine

Here we present OMGs (Ovarian tumour Multidisciplinary intelligent aGent System), a multi-agent AI framework where domain-specific agents deliberate collaboratively to integrate multidisciplinary evidence and generate MDT-style…
To systematically evaluate MDT recommendation quality, we developed SPEAR (Safety, Personalization, Evidence, Actionability, Robustness) and validated OMGs across diverse clinical scenarios spanning the care continuum.

MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian, Haihua Yang · Feb 13, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Medicine

MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities.
For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision…

RE-MCDF: Closed-Loop Multi-Expert LLM Reasoning for Knowledge-Grounded Clinical Diagnosis

Shaowei Shen, Xiaohong Yang, Jie Yang, Lianfen Huang, Yongcai Zhang, Yang Zou · Feb 1, 2026 · Citations: 0

Critique Edit Medicine

In such settings, single-agent systems are vulnerable to self-reinforcing errors, as their predictions lack independent validation and can drift toward spurious conclusions.
Although recent multi-agent frameworks attempt to mitigate this issue through collaborative reasoning, their interactions are often shallow and loosely structured, failing to reflect the rigorous, evidence-driven processes used by clinical…

INSURE-Dial: A Phase-Aware Conversational Dataset & Benchmark for Compliance Verification and Phase Detection

Shubham Kulkarni, Alexander Lyzhov, Preetam Joshi, Shiva Chaitanya · Jan 28, 2026 · Citations: 0

Automatic Metrics Medicine

We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification.
All calls are annotated with a phase-structured JSON schema covering IVR navigation, patient identification, coverage status, medication checks (up to two drugs), and agent identification (CRN), and each phase is labeled for Information and…

PosIR: Position-Aware Heterogeneous Information Retrieval Benchmark

Ziyang Zeng, Dun Zhang, Yu Yan, Xu Sun, Cuiqiaoshu Pan, Yudong Zhou · Jan 13, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Medicine

To address these limitations, we introduce PosIR (Position-Aware Information Retrieval), the first standardized benchmark designed to systematically diagnose position bias in diverse retrieval scenarios.
Extensive experiments on 10 state-of-the-art embedding-based retrieval models reveal that: (1) retrieval performance on PosIR with documents exceeding 1536 tokens correlates poorly with the MMTEB benchmark, exposing limitations of current…

EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

Mingyang Wei, Dehai Min, Zewen Liu, Yuzhang Xie, Guanchen Wu, Ziyang Zhang · Jan 6, 2026 · Citations: 0

Automatic Metrics Medicine

Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference.
We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature.

Enhancing Moral Diagnosis and Correction in Large Language Models

Bocheng Chen, Xi Chen, Han Zi, Haitao Mao, Zimo Qi, Xitong Zhang · Jan 6, 2026 · Citations: 0

Red Team Medicine

Agentic Retoucher for Text-To-Image Generation

Shaocheng Shen, Jianfeng Liang, Chunlei Cai, Cong Geng, Huiyu Duan, Xiaoyun Zhang · Jan 5, 2026 · Citations: 0

Pairwise Preference Medicine

To close this gap, we propose Agentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop.
Specifically, we design (1) a perception agent that learns contextual saliency for fine-grained distortion localization under text-image consistency cues, (2) a reasoning agent that performs human-aligned inferential diagnosis via…

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

Junyu Liu, Zirui Li, Qian Niu, Zequn Zhang, Yue Xun, Wenlong Hou · Jan 4, 2026 · Citations: 0

Red Team MedicineMultilingual

To address these gaps, we introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of LLMs for Japanese healthcare.
Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability.

From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark

Jinning Zhang, Jie Song, Wenhui Tu, Zecheng Li, Jingxuan Li, Jin Li · Jan 1, 2026 · Citations: 0

Rubric RatingExpert Verification Automatic Metrics Medicine

Validated in sports rehabilitation, we release a knowledge graph (357,844 nodes, 371,226 edges) and a benchmark of 1,637 QA pairs.
Five expert clinicians rated the system 4.66--4.84 on a 5-point Likert scale, and system rankings are preserved on a human-verified gold subset (n=80).

Reason2Decide: Rationale-Driven Multi-Task Learning

H M Quamran Hasan, Housam Khalifa Bashier, Jiayi Dai, Mi-Young Kim, Randy Goebel · Dec 23, 2025 · Citations: 0

Llm As JudgeAutomatic Metrics Medicine

Across model sizes, Reason2Decide outperforms other fine-tuning baselines and some zero-shot LLMs in prediction (F1) and rationale fidelity (BERTScore, BLEU, LLM-as-a-Judge).
This indicates that LLM-generated rationales are suitable for pretraining models, reducing reliance on human annotations.

ClinicalTrialsHub: Bridging Registries and Literature for Comprehensive Clinical Trial Access

Jiwoo Park, Ruoqi Liu, Avani Jagdale, Andrew Srisuwananukorn, Jing Zhao, Lang Li · Dec 9, 2025 · Citations: 0

Expert Verification Medicine

We demonstrate its utility through a user study involving clinicians, clinical researchers, and PhD students of pharmaceutical sciences and nursing, and a systematic automatic evaluation of its information extraction and question answering…

Diffusion Model in Latent Space for Medical Image Segmentation Task

Huynh Trinh Ngoc, Toan Nguyen Hai, Ba Luong Son, Long Tran Quoc · Dec 1, 2025 · Citations: 0

Expert Verification Automatic Metrics MedicineCoding

RefTr: Recurrent Refinement of Confluent Trajectories for 3D Vascular Tree Centerlines

Roman Naeem, David Hagerman, Jennifer Alvén, Fredrik Kahl · Nov 25, 2025 · Citations: 0

Automatic Metrics Medicine

We further introduce an efficient non-maximum suppression algorithm for spatial tree graphs to merge duplicate branches and extend evaluation metrics to be radius-aware for robust comparison.

Human or LLM as Standardized Patients? A Comparative Study for Medical Education

Bingquan Zhang, Xiaoxiao Liu, Yuchi Wang, Lei Zhou, Qianqian Xie, Benyou Wang · Nov 12, 2025 · Citations: 0

Automatic Metrics Medicine

Although large language model (LLM)-based virtual standardized patients (VSPs) have been proposed as an alternative, their behavior remains unstable and lacks rigorous comparison with human standardized patients.
We propose EasyMED, a multi-agent VSP framework that separates case-grounded information disclosure from response generation to support stable, inquiry-conditioned patient behavior.

Dutch Metaphor Extraction from Cancer Patients' Interviews and Forum Data using LLMs and Human in the Loop

Lifeng Han, David Lindevelt, Sander Puts, Erik van Mulligen, Suzan Verberne · Nov 9, 2025 · Citations: 0

Expert Verification Medicine

With a human-in-the-loop setup, we verify the extracted metaphors and compile the outputs into a corpus named HealthQuote.NL.

Q$^2$: Quantization-Aware Gradient Balancing and Attention Alignment for Low-Bit Quantization

Zhaoyang Wang, Dong Wang · Nov 8, 2025 · Citations: 0

Automatic Metrics Medicine

From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity

Tianxi Wan, Jiaming Luo, Siyuan Chen, Kunyao Lan, Jianhua Chen, Haiyang Geng · Oct 29, 2025 · Citations: 0

Automatic Metrics Medicine

To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation.
Our multi-agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards.

BIOGEN: Evidence-Grounded Multi-Agent Reasoning Framework for Transcriptomic Interpretation in Antimicrobial Resistance

Elias Hossain, Mehrdad Shoeibi, Ivan Garibay, Niloofar Yousefi · Oct 17, 2025 · Citations: 0

Automatic Metrics Medicine

We present BIOGEN, an evidence-grounded multi-agent framework for post hoc interpretation of RNA-seq transcriptional modules.
In comparisons with representative open-source agentic AI baselines, BIOGEN was the only framework that consistently preserved zero hallucination across all five datasets.

Augmenting Rating-Scale Measures with Text-Derived Items Using the Information-Determined Scoring (IDS) Framework

Joe Watson, Ivan O'Connor, Chia-Wen Chen, Luning Sun, Fang Luo, David Stillwell · Oct 9, 2025 · Citations: 0

Rubric Rating Automatic MetricsSimulation Env Medicine

This marks a conceptual departure from traditional automated text scoring by prioritising information gain over fidelity to expert rubrics or human-annotated data.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now