Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 74 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan, Elliot M. Fielstein · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics MedicineMultilingual
  • Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale.
  • Using judge-evaluated outputs as references, the primary LLM achieved an F1 score of 0.80 under relaxed matching criteria.
Open paper
Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka, Takeharu Yoshikawa · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Llm As JudgeAutomatic Metrics MedicineMultilingual
  • A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
  • Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%).
Open paper

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics Medicine
  • Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic…
  • In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC)…
Open paper
Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Critique Edit Automatic Metrics Medicine
  • Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile,…
  • In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track…
Open paper
Learning Diagnostic Reasoning for Decision Support in Toxicology

Nico Oberländer, David Bani-Harouni, Tobias Zellner, Nassir Navab, Florian Eyer, Matthias Keicher · Mar 31, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics Medicine
Open paper
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning

Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner, Hongyuan Mei · Mar 29, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Expert Verification Human EvalAutomatic Metrics Multi Agent Medicine
  • In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded.
  • Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases.
Open paper
Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Expert Verification Llm As JudgeAutomatic Metrics Medicine
  • In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
  • PubMed Reasoner with a GPT-4o backbone achieves 78.32% accuracy on PubMedQA, slightly surpassing human experts, and showing consistent gains on MMLU Clinical Knowledge.
Open paper
Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models

Mikko Saukkoriipi, Nicole Hernandez, Jaakko Sahlsten, Kimmo Kaski, Otso Arponen · Mar 27, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics Medicine
  • Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients.
  • Clinical evaluation identified clinically significant errors in 2.9% of outputs, and semantically equivalent questions occasionally yielded discordant responses, including instances where one formulation was correct and the other contained…
Open paper
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang, Zhoufutu Wen · Mar 27, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Rubric RatingExpert Verification Automatic Metrics LawMedicine
  • To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains.
  • To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases.
Open paper
ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory

Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang, Qing Li · Mar 27, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics Multi Agent Medicine
  • To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians.
  • Extensive experiments demonstrate that ClinicalAgents achieves state-of-the-art performance, significantly enhancing both diagnostic accuracy and explainability compared to strong single-agent and multi-agent baselines.
Open paper
OMIND: Framework for Knowledge Grounded Finetuning and Multi-Turn Dialogue Benchmark for Mental Health LLMs

Suraj Racha, Prashant Harish Joshi, Utkarsh Maurya, Nitin Yadav, Mridul Sharma, Ananya Kunisetty · Mar 26, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Rubric Rating Automatic Metrics Medicine
  • We highlight three primary challenges for LLMs in mental health - lack of high quality interpretable and knowledge grounded training data; training paradigms restricted to core capabilities, and evaluation of multi turn dialogue settings.
  • Addressing it, we present oMind framework which includes training and aligning LLM agents for diverse capabilities including conversations; high quality ~164k multi-task SFT dataset, as a result of our generation pipeline based on…
Open paper
A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment

Sheng Liu, Long Chen, Zeyun Zhao, Qinglin Gou, Qingyue Wei, Arjun Masurkar · Mar 23, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics Multi Agent Medicine
  • We present Cerebra, an interactive multi-agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis.
Open paper
SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model

Guifeng Deng, Pan Wang, Jiquan Wang, Shuying Rao, Junyi Xie, Wanjun Guo · Mar 22, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics Medicine
  • Expert evaluations further validated the quality of the model's reasoning, with mean scores exceeding 4.0/5.0 for factual accuracy, evidence comprehensiveness, and logical coherence.
Open paper
Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon Medicine
  • We propose SEA, a self-learning diagnostic agent with cognitively inspired dual-memory module.
  • On standard evaluation with MedCaseReasoning dataset, SEA achieves 92.46% accuracy, outperforming the strongest baseline by +19.6%, demonstrating the benefit of jointly optimizing reasoning and memory.
Open paper
HippoCamp: Benchmarking Contextual Agents on Personal Computers

Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu, Hoang-Nhat Nguyen, Yichi Zhang · Apr 1, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Tool Use Medicine
  • We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management.
  • We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp.
Open paper
Efficient Failure Management for Multi-Agent Systems with Reasoning Trace Representation

Lingzhe Zhang, Tong Jia, Mingyu Wang, Weijie Hong, Chiming Duan, Minghua He · Mar 23, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Multi Agent Medicine
  • Large Language Models (LLM)-based Multi-Agent Systems (MASs) have emerged as a new paradigm in software system design, increasingly demonstrating strong reasoning and collaboration capabilities.
  • Building on this insight, we propose EAGER, an efficient failure management framework for multi-agent systems based on reasoning trace representation.
Open paper
Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment

Mohamed Sobhi Jabal, Jikai Zhang, Dominic LaBella, Jessica L. Houk, Dylan Zhang, Jeffrey D. Rudie · Mar 23, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Multi Agent Medicine
  • This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification.
  • The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4…
Open paper
Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman, Niloy Kumar Mondal, Md. Mehedi Hasan Shawon, Md Rakibul Hasan · Mar 31, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback
Human EvalAutomatic Metrics Medicine
  • Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance.
  • ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores.
Open paper

Match reason: Matches selected tags (Automatic Metrics, Medicine).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback
Llm As JudgeAutomatic Metrics Medicine
  • Fine-tuning Qwen2.5-7B-Instruct on CrossTrace via QLoRA yields substantial improvements over the untuned baseline: IAScore rises from 0.828 to 0.968 (GPT-4o judge) and from 0.716 to 0.888 (Claude Opus 4.5), structural compliance improves…
  • Human validation of 150 stratified records confirms 99.7% step-level grounding accuracy and a 0.0% fabrication rate.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.