Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 731 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Analysing Calls to Order in German Parliamentary Debates

Nina Smirnova, Daniel Dan, Philipp Mayr · Mar 27, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Law
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
GS-BrainText: A Multi-Site Brain Imaging Report Dataset from Generation Scotland for Clinical Natural Language Processing Development and Validation

Beatrice Alex, Claire Grover, Arlene Casey, Richard Tobin, Heather Whalley, William Whiteley · Mar 27, 2026

Citations: 0

Match reason: Title directly matches "elo".

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Medicine
  • Benchmark evaluation using EdIE-R, an existing rule-based NLP system developed in conjunction with the annotation schema, revealed some performance variation across health boards (F1: 86.13-98.13), phenotypes (F1: 22.22-100) and age groups…
Open paper
Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation

Lorcan McLaren, James Cross, Zuzanna Krakowska, Robin Rauner, Martijn Schoonvelde · Mar 27, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Most evaluations test a single model or configuration; how model choice, model size, learning approach, and prompt style interact, and whether popular "best practices" survive controlled comparison, are largely unexplored.
  • We present a controlled evaluation of these pipeline choices, testing six open-weight models across four political science annotation tasks under identical quantisation, hardware, and prompt-template conditions.
Open paper
KAT-Coder-V2 Technical Report

Fengxiang Li, Han Zhang, Haoyang Huang, Jinghui Wang, Jinhua Hao, Kun Yuan · Mar 29, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Sparse protocol signal Freshness: Hot Status: Ready
Coding
  • We present KAT-Coder-V2, an agentic coding model developed by the KwaiKAT team at Kuaishou.
  • KAT-Coder-V2 adopts a "Specialize-then-Unify" paradigm that decomposes agentic coding into five expert domains - SWE, WebCoding, Terminal, WebSearch, and General - each undergoing independent supervised fine-tuning and reinforcement…
Open paper
LongCat-Next: Lexicalizing Modalities as Discrete Tokens

Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng · Mar 29, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Sparse protocol signal Freshness: Hot Status: Ready
General
  • As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks.
Open paper
Improving Attributed Long-form Question Answering with Intent Awareness

Xinran Zhao, Aakanksha Naik, Jay DeYoung, Joseph Chee Chang, Jena D. Hwang, Tongshuang Wu · Mar 28, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Sparse protocol signal Freshness: Hot Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper

Match reason: Title directly matches "elo".

Score: 80% Sparse protocol signal Freshness: Hot Status: Ready
General
  • Existing datasets do not capture when adverse events (AEs) are officially recognized by regulatory authorities, preventing restriction of analyses to pre-confirmation periods and limiting evaluation of early detection performance.
  • Safety updates peaked around 2012.
Open paper
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning

Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner, Hongyuan Mei · Mar 29, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% High protocol signal Freshness: Hot Status: Ready
Expert Verification Human EvalAutomatic Metrics Multi Agent Medicine
  • In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded.
  • Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases.
Open paper
Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Multilingual
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding

Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Isaac Sanchez, Ben Wiesel, Shafiq Abedin · Mar 28, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Coding
  • To capture the full spectrum of chart comprehension, ChartNet additionally includes specialized subsets encompassing human annotated data, real-world data, safety, and grounding.
  • Fine-tuning on ChartNet consistently improves results across benchmarks, demonstrating its utility as large-scale supervision for multimodal models.
Open paper
Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents

Nicholas Edwards, Sebastian Schuster · Mar 27, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready
Multi Agent Coding
  • We propose an uncertainty-aware multi-agent scaffold that explicitly decouples underspecification detection from code execution.
  • Our results demonstrate that this multi-agent system using OpenHands + Claude Sonnet 4.5 achieves a 69.40% task resolve rate, significantly outperforming a standard single-agent setup (61.20%) and closing the performance gap with agents…
Open paper
PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

Shaoxuan Li, Zhixuan Zhao, Hanze Deng, Zirun Ma, Shulin Tian, Zuyan Liu · Mar 27, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon General
  • We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning.
  • The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation.
Open paper
ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs

Inês Vieira, Inês Calvo, Iago Paulo, James Furtado, Rafael Ferreira, Diogo Tavares · Mar 27, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready
Llm As Judge Multilingual
  • European Portuguese (pt-PT) is particularly affected, as existing training data and benchmarks are mainly in Brazilian Portuguese (pt-BR).
  • To address this, we introduce ALBA, a linguistically grounded benchmark designed from the ground up to assess LLM proficiency in linguistic-related tasks in pt-PT across eight linguistic dimensions, including Language Variety, Culture-bound…
Open paper
Learning to Commit: Generating Organic Pull Requests via Online Repository Memory

Mo Li, L. H. Xu, Qitai Tan, Ting Cao, Yunxin Liu · Mar 27, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready
Coding
  • Large language model (LLM)-based coding agents achieve impressive results on controlled benchmarks yet routinely produce pull requests that real maintainers reject.
  • Simply exposing an agent to the latest repository snapshot is not enough: the snapshot reveals the final state of the codebase, but not the repository-specific change patterns by which that state was reached.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.