Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 18 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation

Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou, Hanrong Zhang · Apr 1, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Critique Edit Simulation Env Long Horizon Coding
  • As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution…
  • In this paper, we present the first systematic study of interruptible agents in long-horizon, environmentally grounded web navigation tasks, where actions induce persistent state changes.
Open paper
The Ultimate Tutorial for AI-driven Scale Development in Generative Psychometrics: Releasing AIGENIE from its Bottle

Lara Russell-Lasalandra, Hudson Golino, Luis Eduardo Garrido, Alexander P. Christensen · Mar 30, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready
Critique Edit Tool Use Coding
  • Psychological scale development has traditionally required extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation can begin.
  • The `AIGENIE` R package implements the AI-GENIE framework (Automatic Item Generation with Network-Integrated Evaluation), which integrates large language model (LLM) text generation with network psychometric methods to automate the early…
Open paper

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Critique Edit Automatic Metrics Coding
  • This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.
  • The suite uses versioned tracks that invite researchers to contribute new benchmark datasets.
Open paper
IntelliAsk: Learning to Ask High-Quality Research Questions via RLVR

Karun Sharma, Vidushee Vats, Shengzhi Li, Yuxiang Wang, Zhongtian Sun, Prayag Tiwari · Jan 23, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise PreferenceExpert Verification Human Eval Coding
  • Peer review relies on substantive, evidence-based questions, yet current LLMs generate surface-level queries that perform worse than human reviewer questions in expert evaluation.
  • To address this gap, we curate a high-quality dataset of reviewer questions from OpenReview and conduct a human preference study where expert annotators evaluate question-paper pairs across three dimensions: effort, evidence, and grounding.
Open paper

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback
Critique Edit Coding
  • While structured feedback can mitigate this issue, existing approaches often rely on externally trained critics or symbolic tools, reducing agent autonomy.
  • This observation helps explain why the agent achieves near-perfect superficial syntactic alignment yet fails to detect or resolve deeper semantic errors.
Open paper
Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback
Critique Edit Coding
  • Agentic AI shifts the investor's role from analytical execution to oversight.
  • We present an agentic strategic asset allocation pipeline in which approximately 50 specialized agents produce capital market assumptions, construct portfolios using over 20 competing methods, and critique and vote on each other's output.
Open paper
Unlocking Reasoning Capability on Machine Translation in Large Language Models

Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio, Tom Kocmi · Feb 16, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Critique Edit Long Horizon MathCoding
  • We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.
Open paper
Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Rubric RatingCritique Edit Coding
  • Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
  • We address this measurement gap with a tri-tier evaluation framework grounded in art-theoretical constructs (Section 2).
Open paper
MARS: toward more efficient multi-agent collaboration for LLM reasoning

Xiao Wang, Jia Wang, Yijie Wang, Pengtao Dang, Sha Cao, Chi Zhang · Sep 24, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 53% High protocol signal Freshness: Cold Status: Ready
Critique Edit Automatic Metrics Multi Agent Coding
  • Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents.
  • In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process.
Open paper
MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision

Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Ryan Chin, Caiming Xiong · May 21, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 53% High protocol signal Freshness: Cold Status: Ready
Critique Edit Automatic Metrics Multi Agent MathCoding
  • Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks.
  • It achieves substantial average accuracy improvements of up to 16.69% on reasoning, 16.66% on coding, and 5.45% on agentic tasks, while maintaining cost efficiency.
Open paper
SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement

Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary · Mar 6, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Critique Edit MathCoding
  • We introduce SAHOO, a practical framework to monitor and control drift through three safeguards: (i) the Goal Drift Index (GDI), a learned multi-signal detector combining semantic, lexical, structural, and distributional measures; (ii)…
Open paper
Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

Lei Huang, Xiang Cheng, Chenxiao Zhao, Guobin Shen, Junjie Yang, Xiaocheng Feng · Mar 4, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Critique Edit Coding
  • Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2\times improvements in sample efficiency compared to RL methods trained solely on scalar…
Open paper

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Critique Edit Coding
  • NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code.
  • All conceptual framing, methodological claims, and final revisions were directed, reviewed, and approved by the human author under a documented human-in-the-loop protocol.
Open paper
Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

Chi Ruan, Dongfu Jiang, Yubo Wang, Wenhu Chen · Sep 26, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 50% Moderate protocol signal Freshness: Cold Status: Fallback
Critique Edit Coding
  • We fine-tune multiple models (Critique-Coder) and evaluate them on different benchmarks to show their advantages over RL-only models.
  • We show that Critique-Coder consistently outperforms RL-only baselines on all the evaluated benchmarks.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.