Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 54 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan, Elliot M. Fielstein · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics MedicineMultilingual
  • Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale.
  • Using judge-evaluated outputs as references, the primary LLM achieved an F1 score of 0.80 under relaxed matching criteria.
Open paper

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics Medicine
  • Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic…
  • In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC)…
Open paper
Learning Diagnostic Reasoning for Decision Support in Toxicology

Nico Oberländer, David Bani-Harouni, Tobias Zellner, Nassir Navab, Florian Eyer, Matthias Keicher · Mar 31, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics Medicine
Open paper
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning

Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner, Hongyuan Mei · Mar 29, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Expert Verification Human EvalAutomatic Metrics Multi Agent Medicine
  • In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded.
  • Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases.
Open paper
Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Expert Verification Llm As JudgeAutomatic Metrics Medicine
  • In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
  • PubMed Reasoner with a GPT-4o backbone achieves 78.32% accuracy on PubMedQA, slightly surpassing human experts, and showing consistent gains on MMLU Clinical Knowledge.
Open paper
Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models

Mikko Saukkoriipi, Nicole Hernandez, Jaakko Sahlsten, Kimmo Kaski, Otso Arponen · Mar 27, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics Medicine
  • Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients.
  • Clinical evaluation identified clinically significant errors in 2.9% of outputs, and semantically equivalent questions occasionally yielded discordant responses, including instances where one formulation was correct and the other contained…
Open paper
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang, Zhoufutu Wen · Mar 27, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Rubric RatingExpert Verification Automatic Metrics LawMedicine
  • To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains.
  • To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases.
Open paper
ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory

Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang, Qing Li · Mar 27, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics Multi Agent Medicine
  • To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians.
  • Extensive experiments demonstrate that ClinicalAgents achieves state-of-the-art performance, significantly enhancing both diagnostic accuracy and explainability compared to strong single-agent and multi-agent baselines.
Open paper
Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Expert Verification Human Eval Medicine
  • To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations.
  • To confirm the validity of our automatic analysis, we further conduct a comprehensive human evaluation involving 56 clinicians from different specialties.
Open paper
A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment

Sheng Liu, Long Chen, Zeyun Zhao, Qinglin Gou, Qingyue Wei, Arjun Masurkar · Mar 23, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics Multi Agent Medicine
  • We present Cerebra, an interactive multi-agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis.
Open paper
SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model

Guifeng Deng, Pan Wang, Jiquan Wang, Shuying Rao, Junyi Xie, Wanjun Guo · Mar 22, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics Medicine
  • Expert evaluations further validated the quality of the model's reasoning, with mean scores exceeding 4.0/5.0 for factual accuracy, evidence comprehensiveness, and logical coherence.
Open paper
Calibrated Confidence Expression for Radiology Report Generation

David Bani-Harouni, Chantal Pellegrini, Julian Lüers, Su Hwan Kim, Markus Baalmann, Benedikt Wiestler · Mar 31, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback
Expert Verification Medicine
  • In a clinical evaluation we show that ConRad's report level scores are well aligned with clinicians' judgment.
Open paper

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics Medicine
  • Post-mitigation evaluation on seven clinically distinct cohorts derived from the MIMIC-IV-ED and eICU databases demonstrates substantial bias reduction: Statistical Parity Difference decreases by 40 to 51 percent on MIMIC-IV-ED and 10 to 19…
Open paper
Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging

Mengxian Lyu, Cheng Peng, Ziyi Chen, Mengyuan Zhang, Jieting Li Lu, Yonghui Wu · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback
Expert Verification Medicine
  • Comprehensive evaluation across medical benchmarks and five clinical generation tasks (e.g., radiology and discharge summarization) shows that merged models can effectively mitigate catastrophic forgetting, preserve clinical domain…
Open paper
Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese

Masataka Kawai, Singo Sakashita, Shumpei Ishikawa, Shogo Watanabe, Anna Matsuoka, Mikio Sakurai · Mar 12, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise PreferenceExpert Verification Medicine
  • We evaluated seven open-source LLMs from three perspectives: (A) generation and information extraction of pathology diagnosis text following predefined formats, (B) correction of typographical errors in Japanese pathology reports, and (C)…
  • In contrast, preferences for explanatory outputs varied substantially across raters.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.