Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 22 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Llm As Judge Law
  • We propose RLearner-LLM with Hybrid-DPO: an automated preference pipeline that fuses a DeBERTa-v3 NLI signal with a verifier LLM score, removing human annotation while overcoming the "alignment tax" of single-signal optimization.
  • Our Qwen3-8B RLearner-LLM wins 95% of pairwise comparisons against its own SFT baseline; GPT-4o-mini in turn wins 95% against our concise output -- alongside the 69% win the same judge gives a verbose SFT over our DPO model, this replicates…
Open paper
HyperMem: Hypergraph Memory for Long-Term Conversations

Juwei Yue, Chuanrui Hu, Jiawei Sheng, Zuyi Zhou, Wenyuan Zhang, Tingwen Liu · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Llm As JudgeAutomatic Metrics General
  • Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues.
  • Experiments on the LoCoMo benchmark show that HyperMem achieves state-of-the-art performance with 92.73% LLM-as-a-judge accuracy, demonstrating the effectiveness of HyperMem for long-term conversations.
Open paper
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

José Pombal, Ricardo Rei, André F. T. Martins · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise PreferenceRubric Rating Llm As Judge Medicine
  • We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings.
  • Using IFEval, a benchmark with programmatically verifiable rubrics, we show that SPB persists even when evaluation criteria are entirely objective: among rubrics where generators fail, judges can be up to 50\% more likely to incorrectly…
Open paper
Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka, Takeharu Yoshikawa · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Llm As JudgeAutomatic Metrics MedicineMultilingual
  • A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
  • Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%).
Open paper
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

Xin Sun, Di Wu, Sijing Qin, Isao Echizen, Abdallah El Ali, Saku Sugawara · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Llm As Judge General
  • Large language models (LLMs) are increasingly used as automated evaluators (LLM-as-a-Judge).
  • Using a counterfactual design, we find that both humans and LLM judges assign higher trust to information labeled as human-authored than to the same content labeled as AI-generated.
Open paper
Text-to-Stage: Spatial Layouts from Long-form Narratives

Jefferson Hernandez, Swarnadeep Saha, Chenxi Whitehouse, Sanjeel Parekh, Calvin Murdock, Yuliang Li · Mar 18, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Llm As Judge General
  • In this work, we probe the ability of a language model to demonstrate spatial reasoning from unstructured text, mimicking human capabilities and automating a process that benefits many downstream media applications.
  • We then introduce a dramaturgy-inspired deterministic evaluation suite and, finally, a training and inference recipe that combines rejection SFT using Best-of-N sampling with RL from verifiable rewards via GRPO.
Open paper
Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge

Junjie Wu, Xuan Kan, Zihao He, Shunwen Tan, Bo Pan, Kaitai Zhang · Mar 12, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Llm As Judge General
  • Multimodal Large Language Models (MLLMs) have been widely adopted as MLLM-as-a-Judges due to their strong alignment with human judgment across various visual tasks.
  • To address this limitation, we propose Multi-Task Reinforcement Learning for MLLM-as-a-Judge (MT-RL-Judge), a framework that jointly optimizes the judge model across multiple tasks, leveraging the generalization capabilities of RL.
Open paper
VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization

Weixin Liu, Congning Ni, Qingyuan Song, Susannah L. Rose, Christopher Symons, Murat Kantarcioglu · Mar 11, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Llm As Judge Medicine
  • We introduce VERI-DPO, which uses claim verification to mine preferences and distill them into the summarizer with Direct Preference Optimization (DPO).
  • On held-out patients, verifier-mined preferences separate candidates by contradiction density, and VERI-DPO reduces Not Supported claim rates from 10.7% to 1.9% (local verifier judge) and from 11.6% to 6.4% (GPT-4o judge), while improving…
Open paper

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Llm As Judge General
  • Checklists have emerged as a popular approach for interpretable and fine-grained evaluation, particularly with LLM-as-a-Judge.
  • To support these use cases, we present AutoChecklist, an open-source library that unifies checklist-based evaluation into composable pipelines.
Open paper
Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks

Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada · Mar 6, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise PreferenceExpert Verification Llm As Judge General
  • This has prompted evaluation frameworks that use LLM-as-judge protocols and claim verification, along with meta-evaluation frameworks that seek to validate these methods.
  • We show that pairwise preference rankings are best suited for system-level evaluation, while explicit metric-wise annotations and expert annotators are critical for reliable metric-level assessment, with subjectivity remaining a key…
Open paper
Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

Ruipeng Jia, Yunyi Yang, Yuxin Wu, Yongbo Gai, Siyuan Tao, Mengyu Zhou · Feb 15, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise PreferenceRubric Rating Llm As Judge General
  • To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which…
  • To keep principles consistent yet editable across various domains, we introduce a two-level meta-rubric refinement pipeline (automated evolutionary refinement for general principles and a reproducible human-in-the-loop procedure for domain…
Open paper
IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

Bosi Wen, Yilin Niu, Cunxiang Wang, Xiaoying Ling, Ying Zhang, Pei Ke · Mar 5, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Llm As Judge General
  • Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models.
  • To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types.
Open paper
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright, Marcus Yearwood · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon General
  • Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
  • We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations.
Open paper
HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong, Subhabrata Mukherjee · Jan 9, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise PreferenceRubric Rating Human EvalLlm As Judge General
  • Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
  • We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations.
Open paper
Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise Preference Llm As Judge Coding
  • However, building a benchmark for LLM-generated web apps remains challenging due to the need for real-world user requirements, generalizable evaluation metrics without relying on ground-truth implementations or test cases, and interpretable…
  • To address these challenges, we introduce WebCoderBench, the first real-world-collected, generalizable, and interpretable benchmark for web app generation.
Open paper
EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation

Xinda Wang, Zhengxu Hou, Yangshijie Zhang, Bingren Yan, Jialin Liu, Chenzhuo Zhao · Aug 8, 2025

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise Preference Llm As Judge Multi Agent General
  • Although the effectiveness of Large Language Models (LLMs) as judges (LLM-as-a-judge) has been validated, their performance remains limited in open-ended tasks, particularly in story evaluation.
  • Accurate story evaluation is crucial not only for assisting human quality judgment but also for providing key signals to guide story generation.
Open paper
No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding

Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, Chris Tanner · Mar 7, 2025

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 53% High protocol signal Freshness: Cold Status: Ready
Pairwise Preference Llm As Judge General
  • To address this gap, we introduce the Business and Finance Fundamentals Benchmark (BFF-Bench), a dataset of 160 challenging questions and long-form responses authored by financial professionals.
  • We demonstrate that providing the judges with expert-written references largely mitigates this issue, highlighting the limits of using LLM-as-a-Judge without any form of human verification.
Open paper
Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics

Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag · Oct 24, 2025

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 50% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise Preference Llm As Judge Multilingual
  • Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and increasingly serve as selection criteria in data filtering and candidate reranking.
  • Through a systematic study of top-performing learned and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation…
Open paper
Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity

Arkadiy Saakyan, Najoung Kim, Smaranda Muresan, Tuhin Chakrabarty · Sep 26, 2025

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 50% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise Preference Llm As Judge General
  • We investigate the relationship between this notion of creativity and n-gram novelty through 8,618 expert writer annotations of novelty, pragmaticality, and sensicality via close reading of human- and AI-generated text.
  • We find that while n-gram novelty is positively associated with expert writer-judged creativity, approximately 91% of top-quartile n-gram novel expressions are not judged as creative, cautioning against relying on n-gram novelty alone.
Open paper
EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu, Wenhu Chen · Sep 30, 2025

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Pairwise Preference).

Score: 53% Moderate protocol signal Freshness: Cold Status: Fallback
Pairwise Preference Llm As Judge General
  • To address this critical bottleneck, we built EditReward, trained with our new large-scale human preference dataset, meticulously annotated by trained experts following a rigorous protocol containing over 200K preference pairs.
  • EditReward demonstrates superior alignment with human preferences in instruction-guided image editing tasks.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.