Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 63 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

How Much LLM Does a Self-Revising Agent Actually Need?

Sungwoo Jung, Seonil Son · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Critique Edit).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Critique Edit Automatic Metrics General
  • Recent LLM-based agents often place world modeling, planning, and reflection inside a single language model loop.
  • We introduce a declared reflective runtime protocol that externalizes agent state, confidence signals, guarded actions, and hypothetical transitions into inspectable runtime structure.
Open paper
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation

Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou, Hanrong Zhang · Apr 1, 2026

Citations: 0

Match reason: Matches selected tags (Critique Edit).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Critique Edit Simulation Env Long Horizon Coding
  • As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution…
  • In this paper, we present the first systematic study of interruptible agents in long-horizon, environmentally grounded web navigation tasks, where actions induce persistent state changes.
Open paper
Citations: 0

Match reason: Matches selected tags (Critique Edit).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Critique Edit Automatic Metrics Medicine
  • Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile,…
  • In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track…
Open paper
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai, Zixian Huang · Mar 30, 2026

Citations: 0

Match reason: Matches selected tags (Critique Edit).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Critique Edit Long Horizon General
  • We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe.
  • On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness,…
Open paper

Match reason: Matches selected tags (Critique Edit).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Critique Edit Long Horizon Math
  • Recent reasoning-focused language models such as DeepSeek R1 and OpenAI o1 have demonstrated strong performance on structured reasoning benchmarks including GSM8K, MATH, and multi-hop question answering tasks.
  • To address this limitation, we introduce Retrieval-Augmented Self-Supervised Prompt Refinement (RASPRef), a framework that improves prompts without requiring human annotations or task-specific supervision.
Open paper
BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents

Praveen Kumar Myakala, Manan Agrawal, Rahul Manche · Mar 25, 2026

Citations: 0

Match reason: Matches selected tags (Critique Edit).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise PreferenceCritique Edit Automatic Metrics General
  • LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved.
  • We further introduce four novel evaluation metrics: Belief Revision Accuracy (BRA), Drift Coherence Score (DCS), Contradiction Resolution Rate (CRR), and Evidence Sensitivity Index (ESI).
Open paper
Citations: 0

Match reason: Matches selected tags (Critique Edit).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Critique Edit Automatic Metrics General
  • We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior.
  • To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning.
Open paper
PAVE: Premise-Aware Validation and Editing for Retrieval-Augmented LLMs

Tianyi Huang, Caden Yang, Emily Yin, Eric Wang, Michael Zhang · Mar 21, 2026

Citations: 0

Match reason: Matches selected tags (Critique Edit).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Critique Edit Automatic Metrics Math
  • In controlled ablations with a fixed retriever and backbone, PAVE outperforms simpler post-retrieval baselines in two evidence-grounded QA settings, with the largest gain reaching 32.7 accuracy points on a span-grounded benchmark.
Open paper
The Ultimate Tutorial for AI-driven Scale Development in Generative Psychometrics: Releasing AIGENIE from its Bottle

Lara Russell-Lasalandra, Hudson Golino, Luis Eduardo Garrido, Alexander P. Christensen · Mar 30, 2026

Citations: 0

Match reason: Matches selected tags (Critique Edit).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready
Critique Edit Tool Use Coding
  • Psychological scale development has traditionally required extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation can begin.
  • The `AIGENIE` R package implements the AI-GENIE framework (Automatic Item Generation with Network-Integrated Evaluation), which integrates large language model (LLM) text generation with network psychometric methods to automate the early…
Open paper
EvoIdeator: Evolving Scientific Ideas through Checklist-Grounded Reinforcement Learning

Andreas Sauter, Yuyue Zhao, Jacopo Urbani, Wenxiang Hu, Zaiqiao Meng, Lun Zhou · Mar 23, 2026

Citations: 0

Match reason: Matches selected tags (Critique Edit).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready
Rubric RatingCritique Edit Llm As Judge General
  • EvoIdeator leverages a structured judge model to generate two synergistic signals: (1) lexicographic rewards for multi-dimensional optimization, and (2) fine-grained language feedback that offers span-level critiques regarding grounding,…
Open paper
Adaptive Robust Estimator for Multi-Agent Reinforcement Learning

Zhongyi Li, Wan Tian, Jingyu Chen, Kangyao Huang, Huiming Zhang, Hui Yang · Mar 23, 2026

Citations: 0

Match reason: Matches selected tags (Critique Edit).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready
Critique Edit Multi Agent Math
  • Multi-agent collaboration has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models, yet it suffers from interaction-level ambiguity that blurs generation, critique, and revision, making credit…
  • To address both issues, we propose a robust multi-agent reinforcement learning framework for collaborative reasoning, consisting of two components: Dual-Agent Answer-Critique-Rewrite (DACR) and an Adaptive Robust Estimator (ARE).
Open paper

Match reason: Matches selected tags (Critique Edit).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback
Critique Edit Coding
  • While structured feedback can mitigate this issue, existing approaches often rely on externally trained critics or symbolic tools, reducing agent autonomy.
  • This observation helps explain why the agent achieves near-perfect superficial syntactic alignment yet fails to detect or resolve deeper semantic errors.
Open paper
Citations: 0

Match reason: Matches selected tags (Critique Edit).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback
Critique Edit Coding
  • Agentic AI shifts the investor's role from analytical execution to oversight.
  • We present an agentic strategic asset allocation pipeline in which approximately 50 specialized agents produce capital market assumptions, construct portfolios using over 20 competing methods, and critique and vote on each other's output.
Open paper
Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation

Zhiting Fan, Ruizhe Chen, Tianxiang Hu, Ru Peng, Zenan Huang, Haokai Xu · Apr 1, 2026

Citations: 0

Match reason: Matches selected tags (Critique Edit).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback
Rubric RatingCritique Edit Law
  • However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to…
Open paper
EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces

Léane Jourdan, Julien Aubert-Béduchaud, Yannis Chupin, Marah Baccari, Florian Boudin · Mar 30, 2026

Citations: 0

Match reason: Matches selected tags (Critique Edit).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback
Critique Edit General
  • This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing.
  • We additionally provide a human-annotated benchmark for revision detection.
Open paper
XSkill: Continual Learning from Experience and Skills in Multimodal Agents

Guanyu Jiang, Zhaochen Su, Xiaoye Qu, Yi R. Fung · Mar 12, 2026

Citations: 0

Match reason: Matches selected tags (Critique Edit).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Critique Edit Tool Use General
  • Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings.
  • To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents.
Open paper

Match reason: Matches selected tags (Critique Edit).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Rubric RatingCritique Edit Llm As Judge General
  • Through a large-scale study of 105,600 evaluation instances (32 LLMs \times 3 frontier judges \times 100 tasks \times 11 temperatures), we show that model-level agreement (Spearman ρ= 0.99) masks fragile sample-level agreement (Pearson r =…
  • Second, we demonstrate that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessment.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.