Skip to content

OpenTrain Research Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 24 Search mode: keyword RSS
MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation

Hsin-Ling Hsu, Cong-Tinh Dao, Luning Wang, Zitao Shuai, Thao Nguyen Minh Phan, Jun-En Ding · Mar 23, 2025

Citations: 0
Expert Verification Automatic Metrics Medicine
  • Comprehensive evaluation demonstrates that our method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.
Can Multimodal LLMs Perform Time Series Anomaly Detection?

Xiongxiao Xu, Haoran Wang, Yueqing Liang, Philip S. Yu, Yue Zhao, Kai Shu · Feb 25, 2025

Citations: 0
Automatic Metrics Multi Agent Medicine
  • One natural way for humans to detect time series anomalies is through visualization and textual description.
  • To address the gap, we build a VisualTimeAnomaly benchmark to comprehensively investigate zero-shot capabilities of MLLMs for TSAD, progressively from point-, range-, to variate-wise anomalies, and extends to irregular sampling conditions.
Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare

Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman, Aaron Lulla · Feb 22, 2025

Citations: 0
Pairwise PreferenceExpert Verification Automatic Metrics MedicineCoding
  • Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions.
  • This design enables systematic evaluations of model performance and bias by studying how demographic factors affect decision-making.
Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations

Zijie Liu, Xinyu Zhao, Jie Peng, Zhuangdi Zhu, Qingyu Chen, Kaidi Xu · Jan 29, 2025

Citations: 0
Automatic MetricsSimulation Env Medicine
  • These tuning methods and benchmarks overlook critical aspects like evidence-based reasoning and handling distracting information.
  • To bridge this gap, we introduce a novel benchmark that simulates real-world diagnostic scenarios, integrating noise and difficulty levels aligned with USMLE standards.

Protocol Hubs

Daily Archives