Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 195 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,634) General (532) Long Horizon (320) Pairwise Preference (289) Coding (221) Simulation Env (190) Multi Agent (184) Medicine (117) Llm As Judge (109) Expert Verification (98) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (78) Demonstrations (67) Critique Edit (63)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Apr 16, 2026 · Citations: 0

We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Latent-Condensed Transformer for Efficient Long Context Modeling
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
Apr 13, 2026 · Citations: 0

Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Apr 13, 2026 · Citations: 0

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
Apr 13, 2026 · Citations: 0

Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies.
Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation
Apr 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Apr 13, 2026 · Citations: 0

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
Apr 11, 2026 · Citations: 0

To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Apr 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
Apr 11, 2026 · Citations: 0

To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths.
MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits
Apr 10, 2026 · Citations: 0

On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Evolutionary System Prompt Learning for Reinforcement Learning in LLMs

Lunjun Zhang, Ryan Chen, Bradly C. Stadie · Feb 16, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Coding

Building agentic systems that can autonomously self-improve from experience is a longstanding goal of AI.
E-SPL encourages a natural division between declarative knowledge encoded in prompts and procedural knowledge encoded in weights, resulting in improved performance across reasoning and agentic tasks.

Open paper

Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

Jingyi Xu, Xingyu Ren, Zhoupeng Shou, Yumeng Zhang, Zhiqiang You · Jan 24, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Ready

Pairwise Preference Automatic Metrics Long Horizon Coding

To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent.
We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data.

Open paper

Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

Ruchira Dhar, Qiwei Peng, Anders Søgaard · Feb 14, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Consequently, we highlight the importance of contrastive evaluation for obtaining a more complete understanding of model capabilities.

Open paper

Text-only adaptation in LLM-based ASR through text denoising

Andrés Carofilis, Sergio Burdisso, Esaú Villatoro-Tello, Shashi Kumar, Kadri Hacioglu, Srikanth Madikeri · Jan 28, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.

Open paper

Rethinking Discrete Speech Representation Tokens for Accent Generation

Jinzuomu Zhong, Yi Wang, Korin Richmond, Peter Bell · Jan 27, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

We propose a unified evaluation framework that measures both accessibility of accent information via a novel Accent ABX task and recoverability via cross-accent Voice Conversion (VC) resynthesis.

Open paper

Multi-Task Instruction Tuning via Data Scheduling for Low-Resource Arabic AudioLLMs

Hunzalah Hassan Bhatti, Firoj Alam, Shammur Absar Chowdhury · Jan 18, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

To support end-to-end Arabic speech summarization, we introduce AraMega-SSum, a first speech summarization resource for training and benchmarking Arabic-centric Audio-LLMs.

Open paper

Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind

Minyuan Ruan, Ziyue Wang, Kaiming Liu, Yunghwei Lai, Peng Li, Yang Liu · Feb 14, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Long Horizon General

Large Language Models (LLMs) have developed rapidly and are widely applied to both general-purpose and professional tasks to assist human users.
Resolving this epistemic divergence requires Theory of Mind (ToM), yet existing ToM evaluations for LLMs primarily focus on isolated belief inference, overlooking its functional utility in real-world interaction.

Open paper

Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation

Hubert M. Pysklo, Artem Zhuravel, Patrick D. Watson · Feb 11, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Tool Use Coding

We present Agent-Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world productivity software API tasks via code execution.
In addition, we evaluate the robustness of the framework with ablation experiments to assess the contribution of access to API documentation on benchmark performance.

Open paper

LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning

Xinwu Ye, Yicheng Mao, Jia Zhang, Yimeng Liu, Li Hao, Fang Wu · Feb 6, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Long Horizon General

Across diverse chemical reasoning benchmarks, LatentChem achieves a 59.88\% non-tie win rate over strong CoT-based baselines on ChemCoTBench, while delivering a 10.84\times average reduction in reasoning overhead.

Open paper

GPT-4o Lacks Core Features of Theory of Mind

John Muchovej, Amanda Royka, Shane Lee, Julian Jara-Ettinger · Feb 12, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 73% Sparse protocol signal Freshness: Warm Status: Ready

General

Research into this question has focused on evaluating LLMs against benchmarks and found success across a range of social tasks.
However, these evaluations do not test for the actual representations posited by ToM: namely, a causal model of mental states and behavior.

Open paper

Contextual Drag: How Errors in the Context Affect LLM Reasoning

Yun Cheng, Xingyu Zhu, Haoyu Zhao, Sanjeev Arora · Feb 4, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 73% Sparse protocol signal Freshness: Warm Status: Ready

General

Across evaluations of 11 proprietary and open-weight models on 8 reasoning tasks, contextual drag induces 10-20% performance drops, and iterative self-refinement in models with severe contextual drag can collapse into self-deterioration.

Open paper

Discovering Implicit Large Language Model Alignment Objectives

Edward Chen, Sanmi Koyejo, Carlos Guestrin · Feb 17, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 57% Moderate protocol signal Freshness: Warm Status: Ready

Rubric Rating Human Eval General

To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives.
Experiments with popular open-source reward models show that the framework consistently captures > 90% of reward behavior, a finding further corroborated by human evaluation.

Open paper

Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought

Yuling Jiao, Yanming Lai, Huazhen Lin, Wensen Ma, Houduo Qi, Defeng Sun · Feb 16, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 54% Sparse protocol signal Freshness: Warm Status: Ready

Long Horizon General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations

William Lugoloobi, Thomas Foster, William Bankes, Chris Russell · Feb 10, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 54% Sparse protocol signal Freshness: Warm Status: Ready

MathCoding

Using E2H-AMC, which provides both human and model performance on identical problems, we show that models encode a model-specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended…
Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best-performing model whilst reducing inference cost by up to 70\% on MATH, showing that internal representations enable practical…

Open paper

GraphSeek: Next-Generation Graph Analytics with LLMs

Maciej Besta, Łukasz Jarmocik, Orest Hrycyna, Shachar Klaiman, Konrad Mączka, Robert Gerstenberger · Feb 11, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Accelerating Scientific Research with Gemini: Case Studies and Common Techniques

David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, MohammadHossein Bateni · Feb 3, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready

MathLaw

Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer.

Open paper

Reward Modeling from Natural Language Human Feedback

Zongqi Wang, Rui Wang, Yuchuan Wu, Yiyao Yu, Pinyi Zhang, Shaoning Sun · Jan 12, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 54% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise PreferenceCritique Edit General

To address this issue, we propose Reward Modeling from Natural Language Human Feedback (RM-NLHF), which leverages natural language feedback to obtain process reward signals, thereby mitigating the problem of limited solution space inherent…
Additionally, considering that human critiques are difficult to scale up, we introduce Meta Reward Model (MetaRM) which learns to predict process reward from datasets with human critiques and then generalizes to data without human…

Open paper

Rewards as Labels: Revisiting RLVR from a Classification Perspective

Zepeng Zhai, Meilin Chen, Jiaxuan Zhao, Junlang Qian, Lei Shen, Yuan Lu · Feb 5, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Math

Extensive experiments on mathematical reasoning benchmarks show that REAL improves training stability and consistently outperforms GRPO and strong variants such as DAPO.

Open paper

$V_0$: A Generalist Value Model for Any Policy at State Zero

Yi-Kai Zhang, Zhiyuan Yao, Hongyan Hao, Yueqing Sun, Qi Gu, Hui Su · Feb 3, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 32% Sparse protocol signal Freshness: Warm Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

Shir Ashury-Tahan, Yifan Mai, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen · Jan 22, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready

Coding

Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail.
Without disentangling such causes, benchmarks remain incomplete and cannot reliably guide model improvement.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent