Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 18 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,634) General (532) Long Horizon (320) Pairwise Preference (289) Coding (221) Simulation Env (190) Multi Agent (184) Medicine (117) Llm As Judge (109) Expert Verification (98) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (78) Demonstrations (67) Critique Edit (63)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Apr 16, 2026 · Citations: 0

We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Latent-Condensed Transformer for Efficient Long Context Modeling
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
Apr 13, 2026 · Citations: 0

Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Apr 13, 2026 · Citations: 0

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
Apr 13, 2026 · Citations: 0

Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies.
Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation
Apr 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Apr 13, 2026 · Citations: 0

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
Apr 11, 2026 · Citations: 0

To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Apr 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
Apr 11, 2026 · Citations: 0

To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths.
MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits
Apr 10, 2026 · Citations: 0

On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation

Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou, Hanrong Zhang · Apr 1, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 65% High protocol signal Freshness: Hot Status: Ready

Critique Edit Simulation Env Long Horizon Coding

As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution…
In this paper, we present the first systematic study of interruptible agents in long-horizon, environmentally grounded web navigation tasks, where actions induce persistent state changes.

Open paper

The Ultimate Tutorial for AI-driven Scale Development in Generative Psychometrics: Releasing AIGENIE from its Bottle

Lara Russell-Lasalandra, Hudson Golino, Luis Eduardo Garrido, Alexander P. Christensen · Mar 30, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready

Critique Edit Tool Use Coding

Psychological scale development has traditionally required extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation can begin.
The `AIGENIE` R package implements the AI-GENIE framework (Automatic Item Generation with Network-Integrated Evaluation), which integrates large language model (LLM) text generation with network psychometric methods to automate the early…

Open paper

Can Large Language Models Replace Human Coders? Introducing ContentBench

Michael Haman · Feb 23, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 58% High protocol signal Freshness: Warm Status: Ready

Critique Edit Automatic Metrics Coding

This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.
The suite uses versioned tracks that invite researchers to contribute new benchmark datasets.

Open paper

From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design

Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen · Feb 14, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Critique Edit Simulation Env Coding

Open paper

IntelliAsk: Learning to Ask High-Quality Research Questions via RLVR

Karun Sharma, Vidushee Vats, Shengzhi Li, Yuxiang Wang, Zhongtian Sun, Prayag Tiwari · Jan 23, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise PreferenceExpert Verification Human Eval Coding

Peer review relies on substantive, evidence-based questions, yet current LLMs generate surface-level queries that perform worse than human reviewer questions in expert evaluation.
To address this gap, we curate a high-quality dataset of reviewer questions from OpenReview and conduct a human preference study where expert annotators evaluate question-paper pairs across three dimensions: effort, evidence, and grounding.

Open paper

From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection

Hongxu Zhou · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Critique Edit Coding

While structured feedback can mitigate this issue, existing approaches often rely on externally trained critics or symbolic tools, reducing agent autonomy.
This observation helps explain why the agent achieves near-perfect superficial syntactic alignment yet fails to detect or resolve deeper semantic errors.

Open paper

The Self Driving Portfolio: Agentic Architecture for Institutional Asset Management

Andrew Ang, Nazym Azimbayev, Andrey Kim · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Critique Edit Coding

Agentic AI shifts the investor's role from analytical execution to oversight.
We present an agentic strategic asset allocation pipeline in which approximately 50 specialized agents produce capital market assumptions, construct portfolios using over 20 competing methods, and critique and vote on each other's output.

Open paper

Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines

Jingjie Ning, Xueqi Li, Chengyu Yu · Apr 1, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Critique Edit Coding

We evaluate this design across two model pairs on three benchmarks spanning knowledge-intensive MCQ and competitive programming.

Open paper

Understanding Teacher Revisions of Large Language Model-Generated Feedback

Conrad Borchers, Luiz Rodrigues, Newarney Torrezão da Costa, Cleon Xavier, Rafael Ferreira Mello · Mar 29, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Critique EditRlaif Or Synthetic Feedback Coding

Open paper

Unlocking Reasoning Capability on Machine Translation in Large Language Models

Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio, Tom Kocmi · Feb 16, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready

Critique Edit Long Horizon MathCoding

We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.

Open paper

Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback

Rubric RatingCritique Edit Coding

Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
We address this measurement gap with a tri-tier evaluation framework grounded in art-theoretical constructs (Section 2).

Open paper

MARS: toward more efficient multi-agent collaboration for LLM reasoning

Xiao Wang, Jia Wang, Yijie Wang, Pengtao Dang, Sha Cao, Chi Zhang · Sep 24, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 53% High protocol signal Freshness: Cold Status: Ready

Critique Edit Automatic Metrics Multi Agent Coding

Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents.
In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process.

Open paper

MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision

Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Ryan Chin, Caiming Xiong · May 21, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 53% High protocol signal Freshness: Cold Status: Ready

Critique Edit Automatic Metrics Multi Agent MathCoding

Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks.
It achieves substantial average accuracy improvements of up to 16.69% on reasoning, 16.66% on coding, and 5.45% on agentic tasks, while maintaining cost efficiency.

Open paper

SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement

Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary · Mar 6, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Critique Edit MathCoding

We introduce SAHOO, a practical framework to monitor and control drift through three safeguards: (i) the Goal Drift Index (GDI), a learned multi-signal detector combining semantic, lexical, structural, and distributional measures; (ii)…

Open paper

Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

Lei Huang, Xiang Cheng, Chenxiao Zhao, Guobin Shen, Junjie Yang, Xiaocheng Feng · Mar 4, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Critique Edit Coding

Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2\times improvements in sample efficiency compared to RL methods trained solely on scalar…

Open paper

Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift

Hyunwoo Kim, Hanau Yi, Jaehee Bae, Yumin Kim · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Critique Edit Coding

NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code.
All conceptual framing, methodological claims, and final revisions were directed, reviewed, and approved by the human author under a documented human-in-the-loop protocol.

Open paper

Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

Chi Ruan, Dongfu Jiang, Yubo Wang, Wenhu Chen · Sep 26, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 50% Moderate protocol signal Freshness: Cold Status: Fallback

Critique Edit Coding

We fine-tune multiple models (Critique-Coder) and evaluate them on different benchmarks to show their advantages over RL-only models.
We show that Critique-Coder consistently outperforms RL-only baselines on all the evaluated benchmarks.

Open paper

REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning

Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Jun Rao, Min Zhang · May 26, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Critique Edit).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback

Critique Edit Coding

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent