An Agentic System for Rare Disease Diagnosis with Traceable Reasoning

Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu, Yuze Sun, Xiao Zhou, Yanfeng Wang, Xin Sun, Ya Zhang, Yongguo Yu, Kun Sun, Weidi Xie · Jun 25, 2025 · Citations: 0

Automatic Metrics Expert Verification Medicine Multi Agent

Open arXiv RSS feed

Abstract

Rare diseases affect over 300 million individuals worldwide, yet timely and accurate diagnosis remains an urgent challenge. Patients often endure a prolonged diagnostic odyssey exceeding five years, marked by repeated referrals, misdiagnoses, and unnecessary interventions, leading to delayed treatment and substantial emotional and economic burdens. Here we present DeepRare, a multi-agent system for rare disease differential diagnosis decision support powered by large language models, integrating over 40 specialized tools and up-to-date knowledge sources. DeepRare processes heterogeneous clinical inputs, including free-text descriptions, structured Human Phenotype Ontology terms, and genetic testing results, to generate ranked diagnostic hypotheses with transparent reasoning linked to verifiable medical evidence. Evaluated across nine datasets from literature, case reports and clinical centres across Asia, North America and Europe spanning 14 medical specialties, DeepRare demonstrates exceptional performance on 3,134 diseases. In human-phenotype-ontology-based tasks, it achieves an average Recall@1 of 57.18%, outperforming the next-best method by 23.79%; in multi-modal tests, it reaches 69.1% compared with Exomiser's 55.9% on 168 cases. Expert review achieved 95.4% agreement on its reasoning chains, confirming their validity and traceability. Our work not only advances rare disease diagnosis but also demonstrates how the latest powerful large-language-model-driven agentic systems can reshape current clinical workflows.

HFEPX Relevance Assessment

This paper has direct human-feedback and/or evaluation protocol signal and is likely useful for eval pipeline design.

Eval-Fit Score

75/100 • High

Use this as a primary source when designing or comparing eval protocols.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Human Data Lens

Uses human feedback: Yes
Feedback types: Expert Verification
Rater population: Domain Experts
Unit of annotation: Ranking
Expertise required: Medicine
Extraction source: Persisted extraction

Evaluation Lens

Evaluation modes: Automatic Metrics
Agentic eval: Multi Agent
Quality controls: Adjudication
Confidence: 0.80
Flags: None

Protocol And Measurement Signals

Benchmarks / Datasets

No benchmark or dataset names were extracted from the available abstract.

Reported Metrics

recallagreementrecall@1

Research Brief

Deterministic synthesis

Here we present DeepRare, a multi-agent system for rare disease differential diagnosis decision support powered by large language models, integrating over 40 specialized tools and up-to-date knowledge sources. HFEPX signals include Expert Verification, Automatic Metrics, Multi Agent with confidence 0.80. Updated from current HFEPX corpus.

Generated Mar 3, 2026, 4:04 PM · Grounded in abstract + metadata only

Key Takeaways

Here we present DeepRare, a multi-agent system for rare disease differential diagnosis decision support powered by large language models, integrating over 40 specialized tools and…
DeepRare processes heterogeneous clinical inputs, including free-text descriptions, structured Human Phenotype Ontology terms, and genetic testing results, to generate ranked…

Researcher Actions

Compare its human-feedback setup against pairwise and rubric hubs.
Identify benchmark choices from full text before operationalizing conclusions.
Validate metric comparability (recall, agreement, recall@1).

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Extraction confidence is probabilistic and should be validated for critical decisions.

Recommended Queries

human-eval protocol design agent eval benchmark comparison adjudication reporting patterns

Research Summary

Contribution Summary

Here we present DeepRare, a multi-agent system for rare disease differential diagnosis decision support powered by large language models, integrating over 40 specialized tools and up-to-date knowledge sources.
DeepRare processes heterogeneous clinical inputs, including free-text descriptions, structured Human Phenotype Ontology terms, and genetic testing results, to generate ranked diagnostic hypotheses with transparent reasoning linked to…
In human-phenotype-ontology-based tasks, it achieves an average Recall@1 of 57.18%, outperforming the next-best method by 23.79%; in multi-modal tests, it reaches 69.1% compared with Exomiser's 55.9% on 168 cases.

Why It Matters For Eval

Here we present DeepRare, a multi-agent system for rare disease differential diagnosis decision support powered by large language models, integrating over 40 specialized tools and up-to-date knowledge sources.
In human-phenotype-ontology-based tasks, it achieves an average Recall@1 of 57.18%, outperforming the next-best method by 23.79%; in multi-modal tests, it reaches 69.1% compared with Exomiser's 55.9% on 168 cases.

Researcher Checklist

Pass: Human feedback protocol is explicit

Detected: Expert Verification
Pass: Evaluation mode is explicit

Detected: Automatic Metrics
Pass: Quality control reporting appears

Detected: Adjudication
Gap: Benchmark or dataset anchors are present

No benchmark/dataset anchor extracted from abstract.
Pass: Metric reporting is present

Detected: recall, agreement, recall@1

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

OMGs: A multi-agent system supporting MDT decision-making across the ovarian tumour care continuum Protocol Overlap

Citations: 0 Relevance: 10.50 Shared tag: Expert VerificationShared tag: MedicineShared tag: Multi Agent
- Shared 3 HFEPX protocol tags
- Aligned human feedback protocol
- Aligned agent-evaluation setup
TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation Protocol Overlap

Citations: 0 Relevance: 10.50 Shared tag: Expert VerificationShared tag: MedicineShared tag: Multi Agent
- Shared 3 HFEPX protocol tags
- Aligned human feedback protocol
- Aligned agent-evaluation setup
EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis Protocol Overlap

Citations: 0 Relevance: 7.80 Shared tag: Expert VerificationShared tag: Multi Agent
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Aligned agent-evaluation setup
Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation Protocol Overlap

Citations: 0 Relevance: 7.80 Shared tag: Expert VerificationShared tag: Multi Agent
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Aligned agent-evaluation setup
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery Protocol Overlap

Citations: 0 Relevance: 7.80 Shared tag: Expert VerificationShared tag: Multi Agent
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Aligned agent-evaluation setup
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling Protocol Overlap

Citations: 0 Relevance: 7.80 Shared tag: Expert VerificationShared tag: Multi Agent
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Aligned agent-evaluation setup
A Scalable Framework for Evaluating Health Language Models Protocol Overlap

Citations: 0 Relevance: 7.70 Shared tag: Expert VerificationShared tag: Medicine
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Shared metric mentions
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models Protocol Overlap

Citations: 0 Relevance: 7.70 Shared tag: Expert VerificationShared tag: Medicine
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Shared metric mentions
CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications Protocol Overlap

Citations: 0 Relevance: 7.70 Shared tag: Expert VerificationShared tag: Medicine
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Shared metric mentions
Modeling Expert AI Diagnostic Alignment via Immutable Inference Snapshots Protocol Overlap

Citations: 0 Relevance: 7.70 Shared tag: Expert VerificationShared tag: Medicine
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Shared metric mentions
Multi-Objective Alignment of Language Models for Personalized Psychotherapy Protocol Overlap

Citations: 0 Relevance: 7.70 Shared tag: Expert VerificationShared tag: Medicine
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Shared metric mentions
Diffusion Model in Latent Space for Medical Image Segmentation Task Protocol Overlap

Citations: 0 Relevance: 6.80 Shared tag: Expert VerificationShared tag: Medicine
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote