Skip to content
← Back to explorer

Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations

Zijie Liu, Xinyu Zhao, Jie Peng, Zhuangdi Zhu, Qingyu Chen, Kaidi Xu, Xia Hu, Tianlong Chen · Jan 29, 2025 · Citations: 0

Abstract

Current medical AI systems often fail to replicate real-world clinical reasoning, as they are predominantly trained and evaluated on static text and question-answer tasks. These tuning methods and benchmarks overlook critical aspects like evidence-based reasoning and handling distracting information. To bridge this gap, we introduce a novel benchmark that simulates real-world diagnostic scenarios, integrating noise and difficulty levels aligned with USMLE standards. Moreover, we explore dialogue-based fine-tuning, which transforms static datasets into conversational formats to better capture iterative reasoning processes. Experiments show that dialogue-tuned models outperform traditional methods, with improvements of $9.64\%$ in multi-round reasoning scenarios and $6.18\%$ in accuracy in a noisy environment. Our findings highlight dialogue tuning as a promising approach for advancing clinically aligned and robust medical AI systems.

Human Data Lens

  • Uses human feedback: No
  • Feedback types: None
  • Rater population: Unknown
  • Unit of annotation: Unknown
  • Expertise required: Medicine

Evaluation Lens

  • Evaluation modes: Automatic Metrics, Simulation Env
  • Agentic eval: None
  • Quality controls: Not reported
  • Confidence: 0.45
  • Flags: ambiguous

Research Summary

Contribution Summary

  • Current medical AI systems often fail to replicate real-world clinical reasoning, as they are predominantly trained and evaluated on static text and question-answer tasks.
  • These tuning methods and benchmarks overlook critical aspects like evidence-based reasoning and handling distracting information.
  • To bridge this gap, we introduce a novel benchmark that simulates real-world diagnostic scenarios, integrating noise and difficulty levels aligned with USMLE standards.

Why It Matters For Eval

  • These tuning methods and benchmarks overlook critical aspects like evidence-based reasoning and handling distracting information.
  • To bridge this gap, we introduce a novel benchmark that simulates real-world diagnostic scenarios, integrating noise and difficulty levels aligned with USMLE standards.

Related Papers