Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 54 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,634) General (532) Long Horizon (320) Pairwise Preference (289) Coding (221) Simulation Env (190) Multi Agent (184) Medicine (117) Llm As Judge (109) Expert Verification (98) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (78) Demonstrations (67) Critique Edit (63)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Apr 16, 2026 · Citations: 0

We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Latent-Condensed Transformer for Efficient Long Context Modeling
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
Apr 13, 2026 · Citations: 0

Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Apr 13, 2026 · Citations: 0

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
Apr 13, 2026 · Citations: 0

Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies.
Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation
Apr 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Apr 13, 2026 · Citations: 0

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
Apr 11, 2026 · Citations: 0

To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Apr 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
Apr 11, 2026 · Citations: 0

To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths.
MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits
Apr 10, 2026 · Citations: 0

On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering

Elyas Irankhah, Samah Fodeh · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready

Expert Verification Automatic Metrics Medicine

Open paper

A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan, Elliot M. Fielstein · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% High protocol signal Freshness: Hot Status: Ready

Expert Verification Automatic Metrics MedicineMultilingual

Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale.
Using judge-evaluated outputs as references, the primary LLM achieved an F1 score of 0.80 under relaxed matching criteria.

Open paper

Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy

Ruijie Yang, Yan Zhu, Peiyao Fu, Te Luo, Zhihua Wang, Xian Yang · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready

Expert Verification Automatic Metrics Medicine

Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic…
In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC)…

Open paper

Learning Diagnostic Reasoning for Decision Support in Toxicology

Nico Oberländer, David Bani-Harouni, Tobias Zellner, Nassir Navab, Florian Eyer, Matthias Keicher · Mar 31, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready

Expert Verification Automatic Metrics Medicine

Open paper

Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning

Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner, Hongyuan Mei · Mar 29, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% High protocol signal Freshness: Hot Status: Ready

Expert Verification Human EvalAutomatic Metrics Multi Agent Medicine

In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded.
Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases.

Open paper

PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering

Yiqing Zhang, Xiaozhong Liu, Fabricio Murai · Mar 28, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% High protocol signal Freshness: Hot Status: Ready

Expert Verification Llm As JudgeAutomatic Metrics Medicine

In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
PubMed Reasoner with a GPT-4o backbone achieves 78.32% accuracy on PubMedQA, slightly surpassing human experts, and showing consistent gains on MMLU Clinical Knowledge.

Open paper

Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models

Mikko Saukkoriipi, Nicole Hernandez, Jaakko Sahlsten, Kimmo Kaski, Otso Arponen · Mar 27, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% High protocol signal Freshness: Hot Status: Ready

Expert Verification Automatic Metrics Medicine

Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients.
Clinical evaluation identified clinically significant errors in 2.9% of outputs, and semantically equivalent questions occasionally yielded discordant responses, including instances where one formulation was correct and the other contained…

Open paper

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang, Zhoufutu Wen · Mar 27, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% High protocol signal Freshness: Hot Status: Ready

Rubric RatingExpert Verification Automatic Metrics LawMedicine

To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains.
To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases.

Open paper

ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory

Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang, Qing Li · Mar 27, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% High protocol signal Freshness: Hot Status: Ready

Expert Verification Automatic Metrics Multi Agent Medicine

To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians.
Extensive experiments demonstrate that ClinicalAgents achieves state-of-the-art performance, significantly enhancing both diagnostic accuracy and explainability compared to strong single-agent and multi-agent baselines.

Open paper

A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

Andong Tan, Shuyu Dai, Jinglu Wang, Fengtao Zhou, Yan Lu, Xi Wang · Mar 26, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready

Expert Verification Human Eval Medicine

To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations.
To confirm the validity of our automatic analysis, we further conduct a comprehensive human evaluation involving 56 clinicians from different specialties.

Open paper

A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment

Sheng Liu, Long Chen, Zeyun Zhao, Qinglin Gou, Qingyue Wei, Arjun Masurkar · Mar 23, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% High protocol signal Freshness: Hot Status: Ready

Expert Verification Automatic Metrics Multi Agent Medicine

We present Cerebra, an interactive multi-agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis.

Open paper

SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model

Guifeng Deng, Pan Wang, Jiquan Wang, Shuying Rao, Junyi Xie, Wanjun Guo · Mar 22, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% High protocol signal Freshness: Hot Status: Ready

Expert Verification Automatic Metrics Medicine

Expert evaluations further validated the quality of the model's reasoning, with mean scores exceeding 4.0/5.0 for factual accuracy, evidence comprehensiveness, and logical coherence.

Open paper

EpiScreen: Early Epilepsy Detection from Electronic Health Records with Large Language Models

Shuang Zhou, Kai Yu, Zaifu Zhan, Huixue Zhou, Min Zeng, Feng Xie · Mar 30, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback

Expert Verification Medicine

Open paper

Calibrated Confidence Expression for Radiology Report Generation

David Bani-Harouni, Chantal Pellegrini, Julian Lüers, Su Hwan Kim, Markus Baalmann, Benedikt Wiestler · Mar 31, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback

Expert Verification Medicine

In a clinical evaluation we show that ConRad's report level scores are well aligned with clinicians' judgment.

Open paper

FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data

Mitul Goswami, Romit Chatterjee, Arif Ahmed Sekh · Mar 16, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 58% High protocol signal Freshness: Warm Status: Ready

Expert Verification Automatic Metrics Medicine

Post-mitigation evaluation on seven clinically distinct cohorts derived from the MIMIC-IV-ED and eICU databases demonstrates substantial bias reduction: Statistical Parity Difference decreases by 40 to 51 percent on MIMIC-IV-ED and 10 to 19…

Open paper

Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering

Lin Fan, Yafei Ou, Zhipeng Deng, Pengyu Dai, Hou Chongxian, Jiale Yan · Mar 14, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 58% High protocol signal Freshness: Warm Status: Ready

Expert Verification Automatic Metrics Long Horizon Medicine

Benchmark: github.com/hahaha111111/Step-CoT.

Open paper

Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging

Mengxian Lyu, Cheng Peng, Ziyi Chen, Mengyuan Zhang, Jieting Li Lu, Yonghui Wu · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Expert Verification Medicine

Comprehensive evaluation across medical benchmarks and five clinical generation tasks (e.g., radiology and discharge summarization) shows that merged models can effectively mitigate catastrophic forgetting, preserve clinical domain…

Open paper

Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning

Mohammad R. Abu Ayyash · Apr 1, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Expert Verification MathMedicine

Open paper

sebis at ArchEHR-QA 2026: How Much Can You Do Locally? Evaluating Grounded EHR QA on a Single Notebook

Ibrahim Ebrar Yurt, Fabian Karl, Tejaswi Choppa, Florian Matthes · Mar 14, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Expert Verification MedicineCoding

Open paper

Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese

Masataka Kawai, Singo Sakashita, Shumpei Ishikawa, Shogo Watanabe, Anna Matsuoka, Mikio Sakurai · Mar 12, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Expert Verification).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise PreferenceExpert Verification Medicine

We evaluated seven open-source LLMs from three perspectives: (A) generation and information extraction of pathology diagnosis text following predefined formats, (B) correction of typographical errors in Japanese pathology reports, and (C)…
In contrast, preferences for explanatory outputs varied substantially across raters.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent