Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 736 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Who We Are, Where We Are: Mental Health at the Intersection of Person, Situation, and Large Language Models

Nikita Soni, August Håkan Nilsson, Syeda Mahwish, Vasudha Varadarajan, H. Andrew Schwartz, Ryan L. Boyd · Mar 6, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • These findings underscore the value of integrating computational modeling with psychological theory to assess dynamic mental states in contextually sensitive and human-understandable ways.
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready
Simulation Env General
  • We show how these models are becoming increasingly powerful in learning various aspects of speech without strong linguistic priors, and how many features of early language development can be explained through a shared set of learning…
Open paper
Evaluating LLM-Based Grant Proposal Review via Structured Perturbations

William Thorne, Joseph James, Yang Wang, Chenghua Lin, Diana Maynard · Mar 9, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready
Human Eval LawCoding
  • As AI-assisted grant proposals outpace manual review capacity in a kind of ``Malthusian trap'' for the research ecosystem, this paper investigates the capabilities and limitations of LLM-based grant reviewing for high-stakes evaluation.
  • Human evaluation shows LLM feedback is largely valid but skewed toward compliance checking over holistic assessment.
Open paper
SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions

Saroj Mishra, Suman Niroula, Umesh Yadav, Dilip Thakur, Srijan Gyawali, Shiva Gaire · Mar 7, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon General
  • Retrieval-Augmented Generation (RAG) systems are increasingly evolving into agentic architectures where large language models autonomously coordinate multi-step reasoning, dynamic memory management, and iterative retrieval strategies.
  • Despite rapid industrial adoption, current research lacks a systematic understanding of Agentic RAG as a sequential decision-making system, leading to highly fragmented architectures, inconsistent evaluation methodologies, and unresolved…
Open paper
LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models

Matthew Lyle Olson, Neale Ratzlaff, Musashi Hinck, Tri Nguyen, Vasudev Lal, Joseph Campbell · Mar 6, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Multi Agent General
  • Large Language Models (LLMs) exhibit impressive general-purpose capabilities but also introduce serious safety risks, particularly the potential for deception as models acquire increased agency and human oversight diminishes.
  • In this work, we present LieCraft: a novel evaluation framework and sandbox for measuring LLM deception that addresses key limitations of prior game-based evaluations.
Open paper
Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 73% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Contemporary artificial intelligence research has been organized around two dominant ambitions: productivity, which treats AI systems as tools for accelerating work and economic output, and alignment, which focuses on ensuring that…
  • This paper articulates and develops a third, emerging ambition: the use of large language models (LLMs) as scientific instruments for studying human behavior, culture, and moral reasoning.
Open paper
Enhancing Consistency of Werewolf AI through Dialogue Summarization and Persona Information

Yoshiki Tanaka, Takumasa Kaneko, Hiroki Onozeki, Natsumi Ezure, Ryuichi Uehara, Zhiyang Qi · Mar 7, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 73% Sparse protocol signal Freshness: Warm Status: Ready
General
  • In this study, we present a Werewolf AI agent developed for the AIWolfDial 2024 shared task, co-hosted with the 17th INLG.
  • By analyzing self-match game logs, we demonstrate that the agent's utterances are contextually consistent and that the character, including tone, is maintained throughout the game.
Open paper
DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model

Hao Yang, Hongbo Zhang, Yanyan Zhao, Bing Qin · Mar 6, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 73% Sparse protocol signal Freshness: Warm Status: Ready
General
  • To evaluate the performance of our model, we develop a comprehensive depth question answer benchmark based on existing depth image datasets, which rigorously assesses understanding in typical depth map scenarios.
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 73% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Human problem-solving is enriched by a diversity of styles and personality traits, yet the development of Large Language Models (LLMs) has largely prioritized uniform performance benchmarks that favour specific behavioural tendencies such…
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Fallback
Expert Verification General
  • A human expert develops a bilateral position (commitments and denials) about a topic through prover-skeptic dialogue with a large language model (LLM) opponent.
Open paper
CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

Mohammed Baharoon, Thibault Heintz, Siavash Raissi, Mahmoud Alabbad, Mona Alhammad, Hassan AlOmaish · Mar 6, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Medicine
  • We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety.
  • CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board-certified radiologists in ReXVal (Kendalls tau = 0.61-0.71; Pearsons r = 0.71-0.84), and through two additional benchmarks that we…
Open paper
QuadAI at SemEval-2026 Task 3: Ensemble Learning of Hybrid RoBERTa and LLMs for Dimensional Aspect-Based Sentiment Analysis

A. J. W. de Vink, Filippos Karolos Ventirozos, Natalia Amat-Lefort, Lifeng Han · Mar 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready
Coding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
KCoEvo: A Knowledge Graph Augmented Framework for Evolutionary Code Generation

Jiazhen Kang, Yuchen Lu, Chen Jiang, Jinrui Liu, Tianhao Zhang, Bo Jiang · Mar 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Coding
  • Both modules are trained with synthetic supervision automatically derived from real-world API diffs, ensuring scalability and minimal human effort.
  • Extensive experiments across single-package and multi-package benchmarks demonstrate that our framework significantly improves migration accuracy, controllability, and execution success over standard LLM baselines.
Open paper
MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs

Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib, Heba Sbahi · Mar 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon Law
  • To evaluate models beyond final-answer accuracy, we propose MIR-E (Mawarith Inheritance Reasoning Evaluation), a weighted multi-stage metric that scores key reasoning stages and captures error propagation across the pipeline.
  • The MAWARITH dataset is publicly available at https://github.com/bouchekif/inheritance_evaluation.
Open paper
CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases

Xiaona Xue, Yiqiao Huang, Jiacheng Li, Yuanhang Zheng, Huiqi Miao, Yunfei Ma · Mar 9, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 32% Sparse protocol signal Freshness: Warm Status: Ready
General
  • However, existing evaluation methods often oversimplify instruction complexity as a mere additive combination of atomic constraints, failing to adequately capture the high-dimensional complexity arising from the intricate interplay of…
  • To bridge this gap, we introduce CCR-Bench, a novel benchmark designed to assess LLMs' adherence to complex instructions.
Open paper
AI Steerability 360: A Toolkit for Steering Large Language Models

Erik Miehling, Karthikeyan Natesan Ramamurthy, Praveen Venkateswaran, Irene Ko, Pierre Dognin, Moninder Singh · Mar 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Comprehensive evaluation and comparison of steering methods/pipelines is facilitated by use case classes (for defining tasks) and a benchmark class (for performance comparison on a given task).
Open paper
Image Generation Models: A Technical History

Rouzbeh Shirvani · Mar 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Gender Bias in MT for a Genderless Language: New Benchmarks for Basque

Amaia Murillo, Olatz-Perez-de-Viñaspre, Naiara Perez · Mar 9, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 32% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Multilingual
  • WinoMTeus adapts the WinoMT benchmark to examine how gender-neutral Basque occupations are translated into gendered languages such as Spanish and French.
  • FLORES+Gender, in turn, extends the FLORES+ benchmark to assess whether translation quality varies when translating from gendered languages (Spanish and English) into Basque depending on the gender of the referent.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.