Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 38 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou, Junshan Zhang · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (General, Rubric Rating).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics General
  • Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
  • To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences.
Open paper
Citations: 0

Match reason: Matches selected tags (General, Rubric Rating).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Rubric Rating Automatic Metrics General
  • As Large Language Model (LLM) capabilities advance, the demand for high-quality annotation of exponentially increasing text corpora has outpaced human capacity, leading to the widespread adoption of LLMs in automatic evaluation and…
  • However, proprietary LLMs often exhibit systematic biases that diverge from human expert consensus, lacks reproducibility, and raises data privacy concerns.
Open paper
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias

Filip J. Kucia, Anirban Chakraborty, Anna Wróblewska · Mar 31, 2026

Citations: 0

Match reason: Matches selected tags (General, Rubric Rating).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Rubric Rating Human Eval General
  • We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring.
  • Our results show that strong open-weight models achieve moderate to high agreement with humans on holistic scoring (Quadratic Weighted Kappa about 0.6), but this does not transfer uniformly to analytic scoring.
Open paper

Match reason: Matches selected tags (General, Rubric Rating).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Rubric Rating Automatic Metrics General
  • Atomic decomposition -- breaking a candidate answer into claims before verifying each against a reference -- is a widely adopted design for LLM-based reference-grounded judges.
  • Among perturbations examined, reference-quality degradation produced the largest accuracy drops for both judge families.
Open paper
Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

Zelin Tan, Zhouliang Yu, Bohan Lin, Zijie Geng, Hejia Geng, Yudong Zhang · Mar 27, 2026

Citations: 0

Match reason: Matches selected tags (General, Rubric Rating).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Rubric Rating Automatic Metrics General
  • We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward…
  • Experiments across multiple model scales and six benchmarks demonstrate that PAPO consistently outperforms ORM, reaching 51.3% vs.\ 46.3% on OlympiadBench while continuing to improve as ORM plateaus and declines.
Open paper
When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools

Xingming Li, Runke Huang, Yanan Bao, Yuye Jin, Yuru Jiao, Qingyong Hu · Mar 25, 2026

Citations: 0

Match reason: Matches selected tags (General, Rubric Rating).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Rubric Rating Automatic Metrics General
  • In this paper, we investigate whether AI can serve as a scalable assessment teammate by extracting structured quality indicators and validating their alignment with human expert judgments.
  • Our contributions include: (1) TEPE-TCI-370h (Tracing Effective Preschool Education), the first large-scale dataset of naturalistic teacher-child interactions in Chinese preschools (370 hours, 105 classrooms) with standardized ECQRS-EC and…
Open paper
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

Michael Krumdick, Varshini Reddy, Shivani Chaudhary, William Day, Maarij Ahmed, Hayan Haqqi · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (General, Rubric Rating).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready
Rubric Rating Long Horizon General
  • To address this, we introduce FrontierFinance, a long-horizon benchmark of 25 complex financial modeling tasks across five core finance models, requiring an average of over 18 hours of skilled human labor per task to complete.
  • We demonstrate that our human experts both receive higher scores on average, and are more likely to provide client-ready outputs than current state-of-the-art systems.
Open paper
MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao · Mar 30, 2026

Citations: 0

Match reason: Matches selected tags (General, Rubric Rating).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback
Rubric Rating General
  • Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs.
  • To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems.
Open paper
Citations: 0

Match reason: Matches selected tags (General, Rubric Rating).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Rubric Rating Simulation Env Multi Agent General
  • Large language models are increasingly proposed as autonomous agents for high-stakes public workflows, yet we lack systematic evidence about whether they would follow institutional rules when granted authority.
  • We evaluate multi-agent governance simulations in which agents occupy formal governmental roles under different authority structures, and we score rule-breaking and abuse outcomes with an independent rubric-based judge across 28,112…
Open paper

Match reason: Matches selected tags (General, Rubric Rating).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Rubric Rating Automatic Metrics General
  • Experiments on the multimodal EssayJudge dataset show that DLOM improves over a generation-based SFT baseline across scoring traits, and DLOM-GF yields further gains when modality relevance is heterogeneous.
  • On the text-only ASAP/ASAP++ benchmarks, DLOM remains effective without visual inputs, and DLOM-DA further improves performance and outperforms strong representative baselines.
Open paper
CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

Pranav Raikote, Korbinian Randl, Ioanna Miliou, Athanasios Lakes, Panagiotis Papapetrou · Mar 12, 2026

Citations: 0

Match reason: Matches selected tags (General, Rubric Rating).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Rubric Rating Automatic Metrics General
  • We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow.
  • Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and…
Open paper
Citations: 0

Match reason: Matches selected tags (General, Rubric Rating).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Rubric Rating Human Eval General
  • This paper investigates the application of state-of-the-art open-weight LLMs for the grading of Austrian A-level German texts, with a particular focus on rubric-based evaluation.
  • The LLMs were able to reach a maximum of 40.6% agreement with the human rater in the rubric-provided sub-dimensions, and only 32.8% of final grades matched the ones given by a human expert.
Open paper
Training data generation for context-dependent rubric-based short answer grading

Pavel Šindelář, Dávid Slivka, Christopher Bouma, Filip Prášil, Ondřej Bojar · Mar 30, 2026

Citations: 0

Match reason: Matches selected tags (General, Rubric Rating).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback
Rubric Rating General
  • However, having to avoid language differences and annotator bias makes the grading of student answers challenging.
Open paper
EvoIdeator: Evolving Scientific Ideas through Checklist-Grounded Reinforcement Learning

Andreas Sauter, Yuyue Zhao, Jacopo Urbani, Wenxiang Hu, Zaiqiao Meng, Lun Zhou · Mar 23, 2026

Citations: 0

Match reason: Matches selected tags (General, Rubric Rating).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Rubric RatingCritique Edit Llm As Judge General
  • EvoIdeator leverages a structured judge model to generate two synergistic signals: (1) lexicographic rewards for multi-dimensional optimization, and (2) fine-grained language feedback that offers span-level critiques regarding grounding,…
Open paper

Match reason: Matches selected tags (General, Rubric Rating).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Rubric RatingCritique Edit Llm As Judge General
  • Through a large-scale study of 105,600 evaluation instances (32 LLMs \times 3 frontier judges \times 100 tasks \times 11 temperatures), we show that model-level agreement (Spearman ρ= 0.99) masks fragile sample-level agreement (Pearson r =…
  • Second, we demonstrate that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessment.
Open paper
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright, Marcus Yearwood · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (General, Rubric Rating).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon General
  • Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
  • We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations.
Open paper
When Do Language Models Endorse Limitations on Human Rights Principles?

Keenan Samway, Nicole Miu Takagi, Rada Mihalcea, Bernhard Schölkopf, Ilias Chalkidis, Daniel Hershcovich · Mar 4, 2026

Citations: 0

Match reason: Matches selected tags (General, Rubric Rating).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise PreferenceRubric Rating General
  • As Large Language Models (LLMs) increasingly mediate global information access with the potential to shape public discourse, their alignment with universal human rights principles becomes important to ensure that these rights are abided by…
  • In this paper, we evaluate how LLMs navigate trade-offs involving the Universal Declaration of Human Rights (UDHR), leveraging 1,152 synthetically generated scenarios across 24 rights articles and eight languages.
Open paper
Optimizing In-Context Demonstrations for LLM-based Automated Grading

Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Kevin Haudek, Joseph Krajcik · Feb 28, 2026

Citations: 0

Match reason: Matches selected tags (General, Rubric Rating).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Rubric RatingDemonstrations General
  • GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.
Open paper
LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering

Rafid Ishrak Jahan, Fahmid Shahriar Iqbal, Sagnik Ray Choudhury · Feb 27, 2026

Citations: 0

Match reason: Matches selected tags (General, Rubric Rating).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise PreferenceRubric Rating General
  • We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA.
  • We propose nine rubrics for answer quality evaluation, and show that simple linear models based on these features perform comparably to state-of-the-art LLM evaluators.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.