Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 20 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

CounselReflect: A Toolkit for Auditing Mental-Health Dialogues

Yahan Li, Chaohao Du, Zeyang Li, Christopher Chun Kuizon, Shupeng Cheng, Angel Hsing-Chi Hwang · Mar 31, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Expert Verification).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Rubric RatingExpert Verification Human Eval Web Browsing Coding
  • The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined…
  • Human evaluation includes a user study with 20 participants and an expert review with 6 mental-health professionals, suggesting that CounselReflect supports understandable, usable, and trustworthy auditing.
Open paper
PRBench: End-to-end Paper Reproduction in Physics Research

Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li · Mar 29, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Expert Verification).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Rubric RatingExpert Verification Automatic MetricsSimulation Env Coding
  • We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics.
  • Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution.
Open paper
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models

Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng, Kyle Lam · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Expert Verification).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics MedicineCoding
  • Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
  • We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case.
Open paper
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video

Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao, Yibing Fu · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Expert Verification).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics MedicineCoding
  • Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
  • We introduce ResGo, a benchmark of laparoscopic frames annotated with Go Zone bounding boxes and clinician-authored rationales covering phase, exposure quality reasoning, next action and risk reminder.
Open paper
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery

David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Expert Verification).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics Multi Agent Coding
  • Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
  • The code, datasets, and evaluation protocols for SparkMe are available as open-source at https://github.com/SALT-NLP/SparkMe.
Open paper
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents

Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu, Liqun Liu · Feb 15, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Expert Verification).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Expert Verification Simulation Env Long Horizon Coding
  • While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem.
  • To address this gap, we propose AD-Bench, a benchmark designed based on real-world business requirements of advertising and marketing platforms.
Open paper
IntelliAsk: Learning to Ask High-Quality Research Questions via RLVR

Karun Sharma, Vidushee Vats, Shengzhi Li, Yuxiang Wang, Zhongtian Sun, Prayag Tiwari · Jan 23, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Expert Verification).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise PreferenceExpert Verification Human Eval Coding
  • Peer review relies on substantive, evidence-based questions, yet current LLMs generate surface-level queries that perform worse than human reviewer questions in expert evaluation.
  • To address this gap, we curate a high-quality dataset of reviewer questions from OpenReview and conduct a human preference study where expert annotators evaluate question-paper pairs across three dimensions: effort, evidence, and grounding.
Open paper
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri, Trizal Garg · Dec 26, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Expert Verification).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics CodingMultilingual
  • To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
  • We evaluate six state-of-the-art models, including GPT-4o, Claude 3.7 Sonnet, and open-source models, using a strict evaluation protocol.
Open paper
Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration

Yinghao Tang, Yupeng Xie, Yingchaojie Feng, Tingfeng Lan, Wei Chen · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Expert Verification).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Expert Verification Multi Agent Coding
  • Recent LLM-based agents can automate content creation, but naively applying them yields uncontrollable and unverifiable outputs.
  • We present ViviDoc, a human-agent collaborative system that generates interactive educational documents from a single topic input.
Open paper
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen, Yishuo Cai · Feb 2, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Expert Verification).

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Expert Verification Automatic Metrics Web Browsing Coding
  • However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations.
  • First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs.
Open paper
Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation

Lingyue Fu, Bolun Zhang, Hao Guan, Yaoming Zhu, Lin Qiu, Weiwen Liu · Oct 28, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Expert Verification).

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready
Expert Verification Llm As JudgeAutomatic Metrics Coding
  • To address these challenges, we propose an agent-driven benchmark construction pipeline that leverages human supervision to efficiently generate diverse project-level tasks.
  • Furthermore, to overcome the inaccuracy of general LLM judges, we propose a highly reliable evaluation framework powered by a specialized, fine-tuned model.
Open paper

Match reason: Matches selected tags (Coding, Expert Verification).

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready
Expert Verification Automatic Metrics Coding
  • However, having AI models generate full reviews in the same way as human reviewers risks exacerbating the irresponsible use of LLM-generated reviews and instigating intentional manipulation.
  • We introduce several baseline approaches and an extendable automatic evaluation framework using top reasoning LLMs as judges to tackle the difficulty of recruiting domain experts for manual evaluation.
Open paper

Match reason: Matches selected tags (Coding, Expert Verification).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise PreferenceExpert Verification MedicineCoding
  • To avoid costly clinician labeling, we introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations.
  • We evaluate PrivMedChat across medical dialogue tasks and assess utility, safety, and privacy under consistent privacy accounting, thereby providing a practical pathway to align medical chatbots while offering formal privacy guarantees.
Open paper
ExpGuard: LLM Content Moderation in Specialized Domains

Minseok Choi, Dongjin Kim, Seungbin Yang, Subin Kim, Youngjun Kwak, Juyoung Oh · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Expert Verification).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Expert Verification LawMedicine
  • With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies.
  • Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial…
Open paper
FHIRPath-QA: Executable Question Answering over FHIR Electronic Health Records

Michael Frew, Nishit Bheda, Bryan Tripp · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Expert Verification).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Expert Verification MedicineCoding
  • In this work, we introduce FHIRPath-QA, the first open dataset and benchmark for patient-specific QA that includes open-standard FHIRPath queries over real-world clinical data.
  • Our results highlight that text-to-FHIRPath synthesis has the potential to serve as a practical foundation for safe, efficient, and interoperable consumer health applications, and our dataset and benchmark serve as a starting point for…
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.