Skip to content
← Back to explorer

PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

Sudip Bhujel · Mar 3, 2026 · Citations: 0

Abstract

Large language models are increasingly used for patient-facing medical assistance and clinical decision support, but adapting them to clinical dialogue often requires supervision derived from doctor-patient conversations that may contain sensitive information. Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content. We present PrivMedChat, an end-to-end framework for differentially private RLHF (DP-RLHF) for medical dialogue. Our design enforces differential privacy at every training stage that directly accesses dialogue-derived supervision: (i) Differential Private Stochastic Gradient Descent (DP-SGD) for medical SFT and (ii) DP-SGD for reward model learning from preference pairs. To limit additional privacy expenditure during alignment, we apply DP-SGD to the PPO actor and critic when operating on dialogue-derived prompts, while the reward model remains fixed after DP training. We also introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations to produce scalable preference data without clinician labeling. Experiments on medical dialogue benchmarks show that PrivMedChat at $\varepsilon=7$ achieves the highest ROUGE-L of 0.156 among all DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, and obtains the highest overall score of 2.86 in a 3-model LLM-jury evaluation, while producing membership-inference signals that are near chance (AUC 0.510-0.555). We open-source our code at https://github.com/sudip-bhujel/privmedchat.

HFEPX Relevance Assessment

This paper has direct human-feedback and/or evaluation protocol signal and is likely useful for eval pipeline design.

Eval-Fit Score

65/100 • Medium

Useful as a secondary reference; validate protocol details against neighboring papers.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

Human Data Lens

  • Uses human feedback: Yes
  • Feedback types: Pairwise Preference, Expert Verification
  • Rater population: Mixed
  • Unit of annotation: Unknown
  • Expertise required: Medicine, Coding
  • Extraction source: Persisted extraction

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: None
  • Quality controls: Not reported
  • Confidence: 0.70
  • Flags: None

Protocol And Measurement Signals

Benchmarks / Datasets

No benchmark or dataset names were extracted from the available abstract.

Reported Metrics

rouge

Research Brief

Deterministic synthesis

Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content. HFEPX signals include Pairwise Preference, Expert Verification, Automatic Metrics with confidence 0.70. Updated from current HFEPX corpus.

Generated Mar 4, 2026, 4:38 AM · Grounded in abstract + metadata only

Key Takeaways

  • Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of…
  • We present PrivMedChat, an end-to-end framework for differentially private RLHF (DP-RLHF) for medical dialogue.

Researcher Actions

  • Compare its human-feedback setup against pairwise and rubric hubs.
  • Identify benchmark choices from full text before operationalizing conclusions.
  • Validate metric comparability (rouge).

Caveats

  • Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
  • Extraction confidence is probabilistic and should be validated for critical decisions.

Research Summary

Contribution Summary

  • Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content.
  • We present PrivMedChat, an end-to-end framework for differentially private RLHF (DP-RLHF) for medical dialogue.
  • Experiments on medical dialogue benchmarks show that PrivMedChat at \varepsilon=7 achieves the highest ROUGE-L of 0.156 among all DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, and obtains the highest overall…

Why It Matters For Eval

  • Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content.
  • Experiments on medical dialogue benchmarks show that PrivMedChat at \varepsilon=7 achieves the highest ROUGE-L of 0.156 among all DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, and obtains the highest overall…

Researcher Checklist

  • Pass: Human feedback protocol is explicit

    Detected: Pairwise Preference, Expert Verification

  • Pass: Evaluation mode is explicit

    Detected: Automatic Metrics

  • Gap: Quality control reporting appears

    No calibration/adjudication/IAA control explicitly detected.

  • Gap: Benchmark or dataset anchors are present

    No benchmark/dataset anchor extracted from abstract.

  • Pass: Metric reporting is present

    Detected: rouge

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.