Skin-R1: Clinical Knowledge-Guided Dermatological Diagnosis Using Vision-Language Models
Zehao Liu, Weijieying Ren, Jipeng Zhang, Tianxiang Zhao, Jingxi Zhu, Xiaoting Li, Vasant G Honavar · Nov 18, 2025 · Citations: 0
How to use this page
Low trustUse this as background context only. Do not make protocol decisions from this page alone.
Best use
Background context only
What to verify
Validate the evaluation procedure and quality controls in the full paper before operational use.
Evidence quality
Low
Derived from extracted protocol signals and abstract evidence.
Abstract
Vision--language models (VLMs) have recently shown promise for assisting clinical reasoning in dermatological diagnosis. However, their trustworthiness and clinical utility remain limited by three key challenges: heterogeneous datasets with inconsistent diagnostic labels and concept annotations, the lack of grounded diagnostic rationales for reliable reasoning supervision, and limited scalability when transferring knowledge from small, densely annotated datasets to large collections with sparse labels. To address these challenges, we propose Skin-R1, a dermatology-oriented VLM that integrates textbook-grounded clinical reasoning supervision with reinforcement learning (RL) to improve the accuracy and robustness of diagnostic prediction. First, we construct a textbook-based reasoning generator that synthesizes hierarchy-aware and differential-diagnosis (DDx) diagnostic trajectories derived from authoritative dermatology knowledge. Second, these trajectories are used for supervised fine-tuning (SFT), establishing a clinically grounded reasoning foundation for the model. Finally, we introduce an RL training framework that incorporates the hierarchical structure of dermatological diseases into the reward design, enabling the model to generalize grounded diagnostic reasoning to large-scale datasets with sparse annotations. Extensive experiments across multiple dermatology benchmarks demonstrate that Skin-R1 consistently improves diagnostic accuracy and robustness compared to state-of-the-art Med-VLM baselines. Ablation studies further highlight the critical role of grounded reasoning supervision introduced during the SFT stage.