Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 89 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou, Junshan Zhang · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics General
  • Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
  • To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences.
Open paper
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias

Filip J. Kucia, Anirban Chakraborty, Anna Wróblewska · Mar 31, 2026

Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Rubric Rating Human Eval General
  • We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring.
  • Our results show that strong open-weight models achieve moderate to high agreement with humans on holistic scoring (Quadratic Weighted Kappa about 0.6), but this does not transfer uniformly to analytic scoring.
Open paper
CounselReflect: A Toolkit for Auditing Mental-Health Dialogues

Yahan Li, Chaohao Du, Zeyang Li, Christopher Chun Kuizon, Shupeng Cheng, Angel Hsing-Chi Hwang · Mar 31, 2026

Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Rubric RatingExpert Verification Human Eval Web Browsing Coding
  • The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined…
  • Human evaluation includes a user study with 20 participants and an expert review with 6 mental-health professionals, suggesting that CounselReflect supports understandable, usable, and trustworthy auditing.
Open paper
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning

Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner, Hongyuan Mei · Mar 29, 2026

Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Expert Verification Human EvalAutomatic Metrics Multi Agent Medicine
  • In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded.
  • Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases.
Open paper

Match reason: Matches selected tags (Human Eval).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready
Human Eval General
  • These systems commonly achieve performance levels comparable to or superior than trained human raters, but have frequently been demonstrated to be vulnerable to the influence of construct-irrelevant factors (i.e., features of responses that…
Open paper
Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready
Human Eval General
  • We propose an agentic evaluation architecture comprising a multimodal screening agent, a heterogeneous jury of five evaluative agents, and a meta-agent for verdict synthesis and human escalation.
  • In an empirical study on Romanian upper-secondary history textbooks, 83.3\% of 270 screened excerpts were classified as pedagogically acceptable (mean severity 2.9/7), versus 5.4/7 under a zero-shot baseline, demonstrating that agentic…
Open paper
STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems

Hongru Ji, Yuyin Fan, Meng Zhao, Xianghua Li, Lianwei Wu, Chao Gao · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready
Human Eval General
  • To support effective learning, we develop a strategy-aware data refinement pipeline integrating LLM-based annotation, multi-model consistency-weighted evaluation, and dynamic sampling to construct high-quality training data aligned with…
  • Extensive experiments demonstrate that STRIDE-ED generalizes across diverse open-source LLMs and consistently outperforms existing methods on both automatic metrics and human evaluations.
Open paper
Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready
Human Eval Coding
  • Extensive experiments on the ESConv dataset demonstrate that PRCCF outperforms state-of-the-art baselines on both automatic metrics and human evaluations.
Open paper

Match reason: Matches selected tags (Human Eval).

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready
Human Eval Law
  • LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed?
  • Through 960 sessions with two model pairs across 15 tasks, we show that persona-based agent judges produce evaluations indistinguishable from human raters in a Turing-style validation.
Open paper
ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection

Yufeng Li, Rrubaa Panchendrarajan, Arkaitz Zubiaga · Mar 31, 2026

Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready
Human Eval General
  • Through component analysis, human evaluation, and error analysis, we further examine when and why the retrieved context contributes to more reliable verifiability judgments.
Open paper
Open Machine Translation for Esperanto

Ona de Gibert, Lluís de Gibert · Mar 31, 2026

Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready
Human Eval CodingMultilingual
  • In this work, we present the first comprehensive evaluation of open-source MT systems for Esperanto, comparing rule-based systems, encoder-decoder models, and LLMs across model sizes.
  • We evaluate translation quality across six language directions involving English, Spanish, Catalan, and Esperanto using multiple automatic metrics as well as human evaluation.
Open paper
DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling

Shicheng Liu, Yucheng Jiang, Sajid Farook, Camila Nicollier Sanchez, David Fernando Castro Pena, Monica S. Lam · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 65% High protocol signal Freshness: Hot Status: Fallback
Human Eval Long Horizon General
  • Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis.
  • In this paper, we present DataSTORM, an LLM-based agentic system capable of autonomously conducting research across both large-scale structured databases and internet sources.
Open paper
Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework

Jiling Zhou, Aisvarya Adeseye, Seppo Virtanen, Antti Hakkala, Jouni Isoaho · Apr 6, 2026

Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Human EvalAutomatic Metrics General
  • However, its reliability in security-sensitive analytical tasks remains insufficiently examined, particularly under structured human evaluation.
  • Human evaluation with strong inter-rater agreement (Cohen's k > 0.80) confirms robustness.
Open paper
Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback
Human EvalAutomatic Metrics Math
  • Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases.
  • With the rise of reasoning-capable models, exposing a generator's reasoning content to the judge provides richer information and is a natural candidate for improving judgment accuracy.
Open paper
Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman, Niloy Kumar Mondal, Md. Mehedi Hasan Shawon, Md Rakibul Hasan · Mar 31, 2026

Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback
Human EvalAutomatic Metrics Medicine
  • Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance.
  • ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores.
Open paper
Learning to Predict Future-Aligned Research Proposals with Language Models

Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu, Jiawei Han · Mar 28, 2026

Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback
Human EvalAutomatic Metrics MathCoding
  • Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality.
  • Finally, we demonstrate practical impact by implementing two model-generated proposals with a code agent, obtaining 4.17% accuracy gain on MATH from a new prompting strategy and consistent improvements for a novel model-merging method.
Open paper
Voxtral TTS

Mistral-AI, :, Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg, Andy Lo · Mar 26, 2026

Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback
Human EvalAutomatic Metrics Multilingual
  • In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5.
Open paper
Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

Jannis Vamvas, Ignacio Pérez Prat, Angela Heldstab, Dominic P. Fischer, Sina Ahmadi, Rico Sennrich · Mar 26, 2026

Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback
Human EvalAutomatic Metrics Multilingual
  • A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.
Open paper
When Hate Meets Facts: LLMs-in-the-Loop for Check-worthiness Detection in Hate Speech

Nicolás Benjamín Ocampo, Tommaso Caselli, Davide Ceolin · Mar 26, 2026

Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback
Human EvalAutomatic Metrics General
  • We validate it through extensive human evaluation, and show that our LLM-in-the-loop framework reduces human effort without compromising the annotation quality of the data.
Open paper
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

Xuanbo Su, Wenhao Hu, Haibo Su, Yunzhang Chen, Le Zhan, Yanqi Yang · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Human Eval).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback
Human EvalSimulation Env General
  • We introduce SalesLLM benchmark, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with…
  • We propose a fully automatic evaluation pipeline that combines (i) an LLM-based rater for sales-process progress,and (ii) fine-tuned BERT classifiers for end-of-dialogue buying intent.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.