Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 83 Search mode: keyword Ranking: eval-signal prioritized Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,620) General (530) Long Horizon (320) Pairwise Preference (288) Coding (218) Simulation Env (187) Multi Agent (182) Medicine (116) Llm As Judge (107) Expert Verification (97) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (77) Demonstrations (67) Critique Edit (63)

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Document Reconstruction Unlocks Scalable Long-Context RLVR

Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin, Jung-jae Kim · Feb 9, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 93% High protocol signal Freshness: Warm Status: Ready

Rubric Rating Automatic Metrics Coding

However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming.
In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision.

Open paper

KLong: Training LLM Agent for Extremely Long-horizon Tasks

Yue Liu, Yingwei Ma, Yibo Miao, Yanhao Li, Yuchong Xie, Xinlong Yang · Feb 19, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 88% Moderate protocol signal Freshness: Warm Status: Ready

Rubric Rating Long Horizon Coding

Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics.
Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.

Open paper

Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins

Jasmin Han, Janardan Devkota, Joseph Waring, Amanda Luken, Felix Naughton, Roger Vilardaga · Feb 23, 2026

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 78% High protocol signal Freshness: Warm Status: Ready

Rubric Rating Automatic Metrics General

Open paper

Discovering Implicit Large Language Model Alignment Objectives

Edward Chen, Sanmi Koyejo, Carlos Guestrin · Feb 17, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Rubric Rating Human Eval General

To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives.
Experiments with popular open-source reward models show that the framework consistently captures > 90% of reward behavior, a finding further corroborated by human evaluation.

Open paper

Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation

Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise PreferenceRubric Rating Multi Agent General

Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
Across 50 rounds (250 paired monologues) judged by five expert annotators using A/B preference and a 15-item rubric, discussion wins 75.6% of instances and improves Craft/Clarity (Δ = 0.440) and Social Response (Δ = 0.422), with occasional…

Open paper

Small Reward Models via Backward Inference

Yike Wang, Faeze Brahman, Shangbin Feng, Teng Xiao, Hannaneh Hajishirzi, Yulia Tsvetkov · Feb 14, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Rubric Rating Llm As Judge Coding

However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility.
Evaluations across four domains using 13 small language models show that FLIP outperforms LLM-as-a-Judge baselines by an average of 79.6%.

Open paper

Confusion-Aware Rubric Optimization for LLM-based Automated Grading

Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Joseph Krajcik, Namsoo Shin · Feb 28, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 88% Moderate protocol signal Freshness: Warm Status: Fallback

Rubric Rating Automatic Metrics Medicine

Empirical evaluations on teacher education and STEM datasets demonstrate that CARO significantly outperforms existing SOTA methods.

Open paper

RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning

Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li, Zhen Qin · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 73% Moderate protocol signal Freshness: Warm Status: Ready

Rubric Rating Automatic Metrics General

Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.

Open paper

Query-focused and Memory-aware Reranker for Long Context Processing

Yuqing Li, Jiangnan Li, Mo Yu, Guoxuan Ding, Zheng Lin, Weiping Wang · Feb 12, 2026

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 73% Moderate protocol signal Freshness: Warm Status: Ready

Rubric Rating Automatic Metrics General

It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage.

Open paper

Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards

Yuanjie Lyu, Chengyu Wang, Lei Shen, Jun Huang, Tong Xu · Jan 30, 2026

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 73% Moderate protocol signal Freshness: Warm Status: Ready

Rubric Rating Simulation Env Tool Use Math

Small LLMs often struggle to match the agentic capabilities of large, costly models.
While reinforcement learning can help, progress has been limited by two structural bottlenecks: existing open-source agentic training data are narrow in task variety and easily solved; real-world APIs lack diversity and are unstable for…

Open paper

Decomposing Physician Disagreement in HealthBench

Satya Borgohain, Roy Mariathas · Feb 26, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Fallback

Rubric Rating Medicine

We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it.
The agreement ceiling in medical AI evaluation is thus largely structural, but the reducible/irreducible dissociation suggests that closing information gaps in evaluation scenarios could lower disagreement where inherent clinical ambiguity…

Open paper

Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

Ruipeng Jia, Yunyi Yang, Yuxin Wu, Yongbo Gai, Siyuan Tao, Mengyu Zhou · Feb 15, 2026

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 68% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise PreferenceRubric Rating Llm As Judge General

To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which…
To keep principles consistent yet editable across various domains, we introduce a two-level meta-rubric refinement pipeline (automated evolutionary refinement for general principles and a reproducible human-in-the-loop procedure for domain…

Open paper

MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian, Haihua Yang · Feb 13, 2026

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 68% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise PreferenceRubric Rating Long Horizon Medicine

MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities.
For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision…

Open paper

LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering

Rafid Ishrak Jahan, Fahmid Shahriar Iqbal, Sagnik Ray Choudhury · Feb 27, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: sparse protocol signal.

Score: 78% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise PreferenceRubric Rating General

We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA.
We propose nine rubrics for answer quality evaluation, and show that simple linear models based on these features perform comparably to state-of-the-art LLM evaluators.

Open paper

Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

Kensuke Okada, Yui Furukawa, Kyosuke Bunji · Feb 19, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: sparse protocol signal.

Score: 78% Sparse protocol signal Freshness: Warm Status: Fallback

Rubric Rating General

Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments.
We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs.

Open paper

The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage

Preni Golazizian, Elnaz Rahmati, Jackson Trager, Zhivar Sourati, Nona Ghazizadeh, Georgios Chochlakis · Feb 10, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: sparse protocol signal.

Score: 78% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise PreferenceRubric Rating Law

By sampling annotators from police-affiliated, justice-system-impacted, and non-affiliated Los Angeles residents, we enable the systematic study of perceptual differences across diverse communities.
To this end, we (i) develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) introduce a rubric-driven preference data construction framework for…

Open paper

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation

Yun Wang, Xuansheng Wu, Jingyuan Huang, Lei Liu, Xiaoming Zhai, Ninghao Liu · Feb 27, 2026

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 68% Moderate protocol signal Freshness: Warm Status: Fallback

Rubric Rating General

Notably, our method achieves fairness gains comparable to using additional real human data, offering a cost-effective solution for ensuring equitable scoring in large-scale assessments.

Open paper

Optimizing In-Context Demonstrations for LLM-based Automated Grading

Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Kevin Haudek, Joseph Krajcik · Feb 28, 2026

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: sparse protocol signal.

Score: 63% Sparse protocol signal Freshness: Warm Status: Fallback

Rubric RatingDemonstrations General

GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.

Open paper

SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing

Yifei Xu, Guilherme Potje, Shivam Shandilya, Tiancheng Yuan, Leonardo de Oliveira Nunes, Rakshanda Agarwal · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: sparse protocol signal.

Score: 63% Sparse protocol signal Freshness: Warm Status: Fallback

Rubric RatingRed Team General

Open paper

The Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Process Quality for Audio Reasoning Models and Agents

Ziyang Ma, Ruiyang Xu, Yinghao Ma, Chao-Han Huck Yang, Bohan Li, Jaeyeon Kim · Feb 15, 2026

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: sparse protocol signal.

Score: 63% Sparse protocol signal Freshness: Warm Status: Fallback

Rubric Rating General

Featured Single Model and Agent tracks, the competition attracting 156 teams from 18 countries and regions.
Results show agent systems currently lead in reasoning quality, utilizing iterative tool orchestration and cross-modal analysis.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent