Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 1 Search mode: keyword RSS
Decomposing Physician Disagreement in HealthBench

Satya Borgohain, Roy Mariathas · Feb 26, 2026

Citations: 0
Rubric Rating Medicine
  • We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it.
  • The agreement ceiling in medical AI evaluation is thus largely structural, but the reducible/irreducible dissociation suggests that closing information gaps in evaluation scenarios could lower disagreement where inherent clinical ambiguity…

Protocol Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.