Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 411 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning

Bingxuan Li, Jeonghwan Kim, Cheng Qian, Xiusi Chen, Eitan Anzenberg, Niran Kundapur · Jan 17, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Long Horizon General
  • To enable a systematic study of this question, we introduce CalConflictBench, a benchmark for long-horizon calendar conflict resolution.
  • To address this gap, we propose PEARL, a reinforcement-learning framework that (i) augments the language agent with an external preference memory that stores and updates inferred strategies (e.g., attendee priorities, topic importance,…
Open paper
Task Arithmetic with Support Languages for Low-Resource ASR

Emma Rafkin, Dan DeGenaro, Xiulin Yang · Jan 11, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
\textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation

Weiye Zhu, Zekai Zhang, Xiangchen Wang, Hewei Pan, Teng Wang, Tiantian Geng · Jan 26, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready
Long Horizon General
  • Vision-and-Language Navigation (VLN) requires agents to interpret natural language instructions and act coherently in visually rich environments.
  • Lacking awareness of how actions transform subsequent visual observations, agents cannot plan actions rationally, leading to unstable behaviors, weak generalization, and cumulative error along trajectory.
Open paper
Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • DGI achieves AUROC=0.958 on human-crafted confabulations with 3.8% cross-domain degradation.
  • External validation on three independently collected human-annotated benchmarks -WikiBio GPT-3, FELM, and ExpertQA- yields domain-specific AUROC 0.581-0.695, with DGI outperforming an NLI CrossEncoder baseline on expert-domain data, where…
Open paper

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% High protocol signal Freshness: Warm Status: Ready
Rubric Rating Automatic Metrics General
  • Distinguishing human-written Korean text from fluent LLM outputs remains difficult even for trained readers, who can over-trust surface well-formedness.
  • We present LREAD, a Korean-specific instantiation of a rubric-based expert-calibration framework for human attribution of LLM-generated text.
Open paper
Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Tool Use Multilingual
  • On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling.
Open paper
Meta-Cognitive Reinforcement Learning with Self-Doubt and Recovery

Zhipeng Zhang, Xiongfei Su, Kai Li · Jan 28, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready
General
  • In this work, we propose a meta-cognitive reinforcement learning framework that enables an agent to assess, regulate, and recover its learning behavior based on internally estimated reliability signals.
  • Experiments on continuous-control benchmarks with reward corruption demonstrate that recovery-enabled meta-cognitive control achieves higher average returns and significantly reduces late-stage training failures compared to strong…
Open paper
ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

Shir Ashury-Tahan, Yifan Mai, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen · Jan 22, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready
Coding
  • Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail.
  • Without disentangling such causes, benchmarks remain incomplete and cannot reliably guide model improvement.
Open paper
Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation

Qianli Wang, Van Bach Nguyen, Yihong Liu, Fedor Splitt, Nils Feldhus, Christin Seifert · Jan 1, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready
Multilingual
  • We first conduct automatic evaluations on both directly generated counterfactuals in the target languages and those derived via English translation across six languages.
Open paper
Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready
Human Eval Multilingual
  • Our results show that proprietary LLMs achieve near human-level APE quality even with simple one-shot prompting, regardless of whether document context is provided.
  • Furthermore, standard automatic metrics do not reliably reflect these qualitative improvements, highlighting the continued necessity of human evaluation.
Open paper
INSURE-Dial: A Phase-Aware Conversational Dataset & Benchmark for Compliance Verification and Phase Detection

Shubham Kulkarni, Alexander Lyzhov, Preetam Joshi, Shiva Chaitanya · Jan 28, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Web Browsing Medicine
  • We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification.
  • All calls are annotated with a phase-structured JSON schema covering IVR navigation, patient identification, coverage status, medication checks (up to two drugs), and agent identification (CRN), and each phase is labeled for Information and…
Open paper
EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation

Pei Yang, Wanyi Chen, Ke Wang, Lynn Ai, Eric Yang, Tianyu Shi · Jan 10, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon Coding
  • Existing evaluations often overlook execution accuracy and safety.
  • We introduce EVM-QuestBench, an execution-grounded benchmark for natural-language transaction-script generation on EVM-compatible chains.
Open paper
HAG: Hierarchical Demographic Tree-based Agent Generation for Topic-Adaptive Simulation

Rongxin Chen, Tianyu Wu, Bingbing Xu, Jiatang Luo, Xiucheng Xu, Huawei Shen · Jan 9, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 32% Sparse protocol signal Freshness: Warm Status: Ready
Simulation Env General
  • High-fidelity agent initialization is crucial for credible Agent-Based Modeling across diverse domains.
  • To address these problems, we propose HAG, a Hierarchical Agent Generation framework that formalizes population generation as a two-stage decision process.
Open paper
Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Political Alignment in Large Language Models: A Multidimensional Audit of Psychometric Identity and Behavioral Bias

Adib Sakhawat, Tahsin Islam, Takia Farhin, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan · Jan 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready
Coding
  • These findings suggest that single-axis evaluations are insufficient and that multidimensional auditing frameworks are important to characterize alignment behavior in deployed LLMs.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.