Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 880 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

James Wedgwood, Aashiq Muhamed, Mona T. Diab, Virginia Smith · Mar 23, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% High protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics General
  • Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility.
  • We propose Dynamic SAE Steering for Preference Alignment (DSPA), an inference-time method that makes sparse autoencoder (SAE) steering prompt-conditional.
Open paper
SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model

Guifeng Deng, Pan Wang, Jiquan Wang, Shuying Rao, Junyi Xie, Wanjun Guo · Mar 22, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% High protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics Medicine
  • Expert evaluations further validated the quality of the model's reasoning, with mean scores exceeding 4.0/5.0 for factual accuracy, evidence comprehensiveness, and logical coherence.
Open paper
Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% High protocol signal Freshness: Hot Status: Ready
Critique Edit Automatic Metrics General
  • We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior.
  • To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning.
Open paper
PAVE: Premise-Aware Validation and Editing for Retrieval-Augmented LLMs

Tianyi Huang, Caden Yang, Emily Yin, Eric Wang, Michael Zhang · Mar 21, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% High protocol signal Freshness: Hot Status: Ready
Critique Edit Automatic Metrics Math
  • In controlled ablations with a fixed retriever and backbone, PAVE outperforms simpler post-retrieval baselines in two evidence-grounded QA settings, with the largest gain reaching 32.7 accuracy points on a span-grounded benchmark.
Open paper
ReViSQL: Achieving Human-Level Text-to-SQL

Yuxuan Zhu, Tengjun Jin, Yoojin Choi, Daniel Kang · Mar 20, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • However, despite these extensive architectural engineering efforts, a significant gap remains: even state-of-the-art (SOTA) AI agents have not yet achieved the human-level accuracy on the BIRD benchmark.
  • We introduce ReViSQL, a streamlined framework that achieves human-level accuracy on BIRD for the first time.
Open paper
What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

Gagan Bhatia, Ahmad Muhammad Isa, Maxime Peyrard, Wei Zhao · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics CodingMultilingual
  • We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and…
  • We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations.
Open paper
Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval

Hangeol Chang, Changsun Lee, Seungjoon Rho, Junho Yeo, Jong Chul Ye · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Coding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper

Match reason: Title directly matches "accuracy".

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Yet, evaluation practices focus primarily on model accuracy rather than whether human-AI teams are prepared to collaborate safely and effectively.
  • We introduce a four part taxonomy of evaluation metrics spanning outcomes, reliance behavior, safety signals, and learning over time, and connect these metrics to the Understand-Control-Improve (U-C-I) lifecycle of human-AI onboarding and…
Open paper
Memento-Skills: Let Agents Design Agents

Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Coding
  • We introduce Memento-Skills, a generalist, continually-learnable LLM agent system that functions as an agent-designing agent: it autonomously constructs, adapts, and improves task-specific agents through experience.
  • Experiments on the General AI Assistants benchmark and Humanity's Last Exam demonstrate sustained gains, achieving 26.2\% and 116.2\% relative improvements in overall accuracy, respectively.
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • As large language models are deployed as autonomous agents with tool execution privileges, a critical assumption underpins their security architecture: that model errors are detectable at runtime.
  • We show benchmark accuracy does not predict governability, correction capacity varies independently of detection, and identical governance scaffolds produce opposite effects across models.
Open paper
TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols

Saketh Vinjamuri, Marielle Fis Loperena, Marie C. Spezia, Ramez Kouzy · Mar 22, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Medicine
  • Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment.
Open paper
Problems with Chinchilla Approach 2: Systematic Biases in IsoFLOP Parabola Fits

Eric Czech, Zhiwei Xu, Yael Elmatad, Yixin Wang, William Held · Mar 21, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Law
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
JUBAKU: An Adversarial Benchmark for Exposing Culturally Grounded Stereotypes in Japanese LLMs

Taihei Shiotani, Masahiro Kaneko, Ayana Niwa, Yuki Maruyama, Daisuke Oba, Masanari Ohi · Mar 21, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Multilingual
  • To address this limitation, we introduce Japanese cUlture adversarial BiAs benchmarK Under handcrafted creation (JUBAKU), a benchmark tailored to Japanese cultural contexts.
  • All models clearly exhibited biases on JUBAKU, performing below the random baseline of 50% with an average accuracy of 23% (ranging from 13% to 33%), despite higher accuracy on the other benchmarks.
Open paper
Geography According to ChatGPT -- How Generative AI Represents and Reasons about Geography

Krzysztof Janowicz, Gengchen Mai, Rui Zhu, Song Gao, Zhangyu Wang, Yingjie Hu · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Through the Looking-Glass: AI-Mediated Video Communication Reduces Interpersonal Trust and Confidence in Judgments

Nelson Navajas Fernández, Jeffrey T. Hancock, Maurice Jakesch · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment

Mohamed Sobhi Jabal, Jikai Zhang, Dominic LaBella, Jessica L. Houk, Dylan Zhang, Jeffrey D. Rudie · Mar 23, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Multi Agent Medicine
  • This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification.
  • The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4…
Open paper
Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% High protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon Coding
  • Inspired by these collaboration primitives, we introduce Centralized Asynchronous Isolated Delegation (CAID), a structured multi-agent coordination paradigm grounded in three core SWE primitives: centralized task delegation, asynchronous…
  • In empirical evaluation, we find that CAID improves accuracy over single-agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on Python library development tasks (Commit0).
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% High protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Multi Agent LawCoding
  • LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly…
  • We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a…
Open paper
RADIUS: Ranking, Distribution, and Significance - A Comprehensive Alignment Suite for Survey Simulation

Weronika Łajewska, Paul Missault, George Davidson, Saab Mansour · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic MetricsSimulation Env General
  • Simulation of surveys using LLMs is emerging as a powerful application for generating human-like responses at scale.
  • In practice, a simulation can achieve high accuracy while still failing to capture the option most preferred by humans - a distinction that is critical in decision-making applications.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.