Skip to content
← Back to explorer

A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives

Hanshu Rao, Weisi Liu, Haohan Wang, I-Chan Huang, Zhe He, Xiaolei Huang · Jun 19, 2025 · Citations: 0

Abstract

Synthetic data generation using large language models (LLMs) demonstrates substantial promise in addressing biomedical data challenges and shows increasing adoption in biomedical research. This study systematically reviews recent advances in synthetic data generation for biomedical applications and clinical research, focusing on how LLMs address data scarcity, utility, and quality issues with different modalities. We conducted a scoping review following PRISMA-ScR guidelines and searched literature published between 2020 and 2025 through PubMed, ACM, Web of Science, and Google Scholar. A total of 59 studies were included based on relevance to synthetic data generation in biomedical contexts. Among the reviewed studies, the predominant data modalities were unstructured texts (78.0\%), tabular data (13.6\%), and multimodal sources (8.4\%). Common generation methods included LLM prompting (74.6\%), fine-tuning (20.3\%), and specialized models (5.1\%). Evaluations were heterogeneous: intrinsic metrics (27.1\%), human-in-the-loop assessments (44.1\%), and LLM-based evaluations (13.6\%). However, limitations and key barriers persist in data modalities, domain utility, resource and model accessibility, and standardized evaluation protocols. Future efforts may focus on developing standardized, transparent evaluation frameworks and expanding accessibility to support effective applications in biomedical research.

Human Data Lens

  • Uses human feedback: No
  • Feedback types: None
  • Rater population: Unknown
  • Unit of annotation: Unknown
  • Expertise required: Medicine

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: None
  • Quality controls: Not reported
  • Confidence: 0.30
  • Flags: low_signal, possible_false_positive

Research Summary

Contribution Summary

  • Synthetic data generation using large language models (LLMs) demonstrates substantial promise in addressing biomedical data challenges and shows increasing adoption in biomedical research.
  • This study systematically reviews recent advances in synthetic data generation for biomedical applications and clinical research, focusing on how LLMs address data scarcity, utility, and quality issues with different modalities.
  • We conducted a scoping review following PRISMA-ScR guidelines and searched literature published between 2020 and 2025 through PubMed, ACM, Web of Science, and Google Scholar.

Why It Matters For Eval

  • Evaluations were heterogeneous: intrinsic metrics (27.1\%), human-in-the-loop assessments (44.1\%), and LLM-based evaluations (13.6\%).
  • However, limitations and key barriers persist in data modalities, domain utility, resource and model accessibility, and standardized evaluation protocols.

Related Papers