Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 20 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

José Pombal, Ricardo Rei, André F. T. Martins · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise PreferenceRubric Rating Llm As Judge Medicine
  • We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings.
  • Using IFEval, a benchmark with programmatically verifiable rubrics, we show that SPB persists even when evaluation criteria are entirely objective: among rubrics where generators fail, judges can be up to 50\% more likely to incorrectly…
Open paper
Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka, Takeharu Yoshikawa · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Llm As JudgeAutomatic Metrics MedicineMultilingual
  • A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
  • Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%).
Open paper
DongYuan: An LLM-Based Framework for Integrative Chinese and Western Medicine Spleen-Stomach Disorders Diagnosis

Hua Li, Yingying Li, Xiaobin Feng, Xinyi Fu, Lifeng Dong, Qingfeng Yang · Mar 30, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Web Browsing Medicine
  • While large language models (LLMs) offer new potential for medical applications, they face three major challenges in the context of integrative Chinese and Western medicine (ICWM): a lack of high-quality data, the absence of models capable…
  • tuning (SFT) and direct preference optimization (DPO), and complemented it with SSDF-Navigator, a pluggable consultation navigation model designed to optimize clinical inquiry strategies.
Open paper
Citations: 0

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise PreferenceRubric Rating Automatic Metrics Medicine
  • We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses.
  • Across 7 benchmarks and 5 task models, PEEM's accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate Spearman rho about 0.97, Pearson r about 0.94, p < 0.001).
Open paper
CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

Mohammed Baharoon, Thibault Heintz, Siavash Raissi, Mahmoud Alabbad, Mona Alhammad, Hassan AlOmaish · Mar 6, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Medicine
  • We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety.
  • CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board-certified radiologists in ReXVal (Kendalls tau = 0.61-0.71; Pearsons r = 0.71-0.84), and through two additional benchmarks that we…
Open paper
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models

Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Human Eval MathMedicine
  • We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight…
  • Blinded human evaluations over 580 query pairs show an 84% aggregate preference for trustworthiness and metacognitive self-awareness over standard and Chain-of-Thought baselines.
Open paper
Multi-Objective Alignment of Language Models for Personalized Psychotherapy

Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli, Majid Sarrafzadeh · Feb 17, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise PreferenceExpert Verification Automatic Metrics Medicine
  • While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.
  • We survey 335 individuals with lived mental health experience to collect preference rankings across therapeutic dimensions, then develop a multi-objective alignment framework using direct preference optimization.
Open paper
PosIR: Position-Aware Heterogeneous Information Retrieval Benchmark

Ziyang Zeng, Dun Zhang, Yu Yan, Xu Sun, Cuiqiaoshu Pan, Yudong Zhou · Jan 13, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Medicine
  • To address these limitations, we introduce PosIR (Position-Aware Information Retrieval), the first standardized benchmark designed to systematically diagnose position bias in diverse retrieval scenarios.
  • Extensive experiments on 10 state-of-the-art embedding-based retrieval models reveal that: (1) retrieval performance on PosIR with documents exceeding 1536 tokens correlates poorly with the MMTEB benchmark, exposing limitations of current…
Open paper
VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization

Weixin Liu, Congning Ni, Qingyuan Song, Susannah L. Rose, Christopher Symons, Murat Kantarcioglu · Mar 11, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Llm As Judge Medicine
  • We introduce VERI-DPO, which uses claim verification to mine preferences and distill them into the summarizer with Direct Preference Optimization (DPO).
  • On held-out patients, verifier-mined preferences separate candidates by contradiction density, and VERI-DPO reduces Not Supported claim rates from 10.7% to 1.9% (local verifier judge) and from 11.6% to 6.4% (GPT-4o judge), while improving…
Open paper
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian, Haihua Yang · Feb 13, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise PreferenceRubric Rating Long Horizon Medicine
  • MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities.
  • For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision…
Open paper
TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Arushi Rai, Qiang Zhang, Hanqing Zeng, Yunkai Zhang, Dipesh Tamboli, Xiangjun Fan · Mar 19, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Pairwise Preference MathMedicine
  • Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning.
  • Furthermore, TARo also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimization to robust, cross-domain reasoning.
Open paper
OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation

Zhuoxiao Chen, Hongyang Yu, Ying Xu, Yadan Luo, Long Duong, Yuan-Fang Li · Sep 23, 2025

Citations: 0

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise Preference Automatic Metrics Medicine
  • OraPO enables single-stage, RL-only training by converting failed GRPO explorations on rare or difficult studies into direct preference supervision via a lightweight oracle step.
Open paper
Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare

Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman, Aaron Lulla · Feb 22, 2025

Citations: 0

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise PreferenceExpert Verification Automatic Metrics Medicine
  • Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions.
  • We outline a series of intended use cases and demonstrate the usability of our dataset by evaluating sixteen off-the-shelf and six (mental) health fine-tuned LMs on category-specific task accuracy, on the fairness impact of patient…
Open paper

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Medicine
  • Applied to 102 handbooks from 23 centers and 1,115 benchmark questions, the framework quantifies heterogeneity across four dimensions: question, topic, organ, and center.
Open paper
Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese

Masataka Kawai, Singo Sakashita, Shumpei Ishikawa, Shogo Watanabe, Anna Matsuoka, Mikio Sakurai · Mar 12, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise PreferenceExpert Verification Medicine
  • We evaluated seven open-source LLMs from three perspectives: (A) generation and information extraction of pathology diagnosis text following predefined formats, (B) correction of typographical errors in Japanese pathology reports, and (C)…
  • In contrast, preferences for explanatory outputs varied substantially across raters.
Open paper
On the Reliability of Cue Conflict and Beyond

Pum Jun Kim, Seung-Ah Lee, Seongho Park, Dongyoon Han, Jaejun Yoo · Mar 11, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Medicine
  • Understanding how neural networks rely on visual cues offers a human-interpretable view of their internal decision processes.
  • We introduce REFINED-BIAS, an integrated dataset and evaluation framework for reliable and interpretable shape-texture bias diagnosis.
Open paper

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise PreferenceExpert Verification MedicineCoding
  • To avoid costly clinician labeling, we introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations.
  • We evaluate PrivMedChat across medical dialogue tasks and assess utility, safety, and privacy under consistent privacy accounting, thereby providing a practical pathway to align medical chatbots while offering formal privacy guarantees.
Open paper
Cold-Start Personalization via Training-Free Priors from Structured World Models

Avinandan Bose, Shuyue Stella Li, Faeze Brahman, Pang Wei Koh, Simon Shaolei Du, Yulia Tsvetkov · Feb 16, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference MathMedicine
  • Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available.
  • Across medical, mathematical, social, and commonsense reasoning, Pep achieves 80.8% alignment between generated responses and users' stated preferences versus 68.5% for RL, with 3-5x fewer interactions.
Open paper
Agentic Retoucher for Text-To-Image Generation

Shaocheng Shen, Jianfeng Liang, Chunlei Cai, Cong Geng, Huiyu Duan, Xiaoyun Zhang · Jan 5, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Medicine
  • To close this gap, we propose Agentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop.
  • Specifically, we design (1) a perception agent that learns contextual saliency for fine-grained distortion localization under text-image consistency cues, (2) a reasoning agent that performs human-aligned inferential diagnosis via…
Open paper

Match reason: Matches selected tags (Medicine, Pairwise Preference).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback
Pairwise Preference Medicine
  • To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from multimodal biomedical data through preference…
  • We demonstrate its effectiveness through two key applications: (1) Rare genetic disease prediction from texts, where MINT uses a multimodal encoder model, trained on facial photos and clinical notes, to generate a preference dataset for…
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.