Skip to content
← Back to explorer

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions

Hanjie Chen, Zhouxiang Fang, Yash Singla, Mark Dredze · Feb 28, 2024 · Citations: 0

Abstract

LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations. However, medical board exams or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets. Datasets and code are available at https://github.com/HanjieChen/ChallengeClinicalQA. JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. Both datasets are structured as multiple-choice question-answering tasks, accompanied by expert-written explanations. We evaluate seven LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. In-depth automatic and human evaluations of model-generated explanations provide insights into the promise and deficiency of LLMs for explainable medical QA.

Human Data Lens

  • Uses human feedback: No
  • Feedback types: None
  • Rater population: Domain Experts
  • Unit of annotation: Unknown
  • Expertise required: Medicine, Coding

Evaluation Lens

  • Evaluation modes: Human Eval
  • Agentic eval: None
  • Quality controls: Not reported
  • Confidence: 0.30
  • Flags: low_signal, possible_false_positive

Research Summary

Contribution Summary

  • LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations.
  • However, medical board exams or general clinical questions do not capture the complexity of realistic clinical cases.
  • Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions.

Why It Matters For Eval

  • Experiments demonstrate that our datasets are harder than previous benchmarks.
  • In-depth automatic and human evaluations of model-generated explanations provide insights into the promise and deficiency of LLMs for explainable medical QA.

Related Papers