Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 310 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,610) General (530) Long Horizon (319) Pairwise Preference (287) Coding (216) Simulation Env (186) Multi Agent (182) Medicine (115) Llm As Judge (106) Expert Verification (97) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (77) Demonstrations (67) Critique Edit (63)

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo

Ojas Jain, Dhruv Kumar · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env, Coding).

Score: 65% High protocol signal Freshness: Hot Status: Fallback

Simulation Env Multi Agent Coding

We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning…
We additionally contribute a fully functional 4-player Ludo simulator supporting Random, Heuristic, Game-Theory, and LLM agents.

Open paper

Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives

Changgeon Ko, Jisu Shin, Hoyun Song, Huije Lee, Eui Jun Hwang, Jong C. Park · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready

Automatic MetricsSimulation Env Multi Agent General

Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision.
Our experiments demonstrate that the representative agent's accuracy consistently declines as social pressure increases: larger adversarial groups, more capable peers, and longer arguments all lead to significant performance degradation.

Open paper

QED-Nano: Teaching a Tiny Model to Prove Hard Theorems

LM-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching, Jia Li · Apr 6, 2026

Citations: 0

Match reason: Matches selected tags (Coding).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready

Rubric Rating Automatic Metrics MathCoding

To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.

Open paper

Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling

Qingyang Xu, Yaling Shen, Stephanie Fong, Zimu Wang, Yiwen Jiang, Xiangyu Zhao · Apr 6, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready

Red Team Simulation Env Medicine

The increasing use of large language models (LLMs) in mental healthcare raises safety concerns in high-stakes therapeutic interactions.
To address this gap, we introduce Personality-based Client Simulation Attack (PCSA), the first red-teaming framework that simulates clients in psychological counseling through coherent, persona-driven client dialogues to expose…

Open paper

ActionParty: Multi-Subject Action Binding in Generative Video Games

Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov, Fabio Pizzati · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready

Automatic MetricsSimulation Env Multi Agent General

However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene.
We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments.

Open paper

ReDAct: Uncertainty-Aware Deferral for LLM Agents

Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov, Nikita Glazkov, Ivan Nasonov, Ilya Makarov · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 55% High protocol signal Freshness: Hot Status: Fallback

Simulation Env Long Horizon General

Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems.
In ReDAct, an agent is equipped with two LLMs: a small, cheap model used by default, and a large, more reliable but expensive model.

Open paper

The Detection-Extraction Gap: Models Know the Answer Before They Can Say It

Hanyang Wang, Mingxuan Zhu · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Coding).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Tool Use Coding

Across five model configurations, two families, and three benchmarks, we find that 52--88% of chain-of-thought tokens are produced after the answer is recoverable from a partial prefix.

Open paper

Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

Komal Kumar, Aman Chadha, Salman Khan, Fahad Shahbaz Khan, Hisham Cholakkal · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Coding).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Multi Agent Coding

Recent advances in multi-agent large language models (LLMs) have demonstrated strong potential for understanding user intent and are being trained to utilize various tools.
In this paper, we introduce Paper Circle, a multi-agent research discovery and analysis system designed to reduce the effort required to find, assess, organize, and understand academic literature.

Open paper

AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning

Yuanfu Sun, Kang Li, Dongzhe Fan, Jiajin Liu, Qiaoyu Tan · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Coding).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Tool Use Coding

To bridge this gap, we introduce Agentic Graph Learning (AGL), a paradigm that reframes graph learning as an interleaved process of topology-aware navigation and LLM-based inference.
Specifically, we propose AgentGL, the first reinforcement learning (RL)-driven framework for AGL.

Open paper

SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao · Apr 6, 2026

Citations: 0

Match reason: Matches selected tags (Coding).

Score: 55% High protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Long Horizon Coding

Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self-evolving paradigms remain inefficient: agents learn in isolation, repeatedly rediscover similar behaviors from limited…
To address this problem, we propose SkillX, a fully automated framework for constructing a plug-and-play skill knowledge base that can be reused across agents and environments.

Open paper

SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks

Sunder Ali Khowaja, Kapal Dev, Engin Zeydan, Madhusanka Liyanage · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic MetricsSimulation Env General

Open paper

From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion

Liang Zhu, Haolin Chen, Lidong Zhao, Xian Wu · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Coding).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Web Browsing Coding

Extensive evaluations across 1.5B--14B parameter models demonstrate that APC reduces expected editing costs from 19% to 50% while preserving standard HC performance.

Open paper

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Jiawei Chen, Ruoxi Xu, Boxi Cao, Ruotong Pan, Yunfei Zhang, Yifei Hu · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback

Simulation Env Long Horizon General

However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior.
To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework.

Open paper

Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation

Philipp D. Siedler · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback

Simulation Env Multi Agent Law

We present the Strategic Courtroom Framework, a multi-agent simulation environment in which prosecution and defense teams composed of trait-conditioned Large Language Model (LLM) agents engage in iterative, round-based legal argumentation.
Agents are instantiated using nine interpretable traits organized into four archetypes, enabling systematic control over rhetorical style and strategic orientation.

Open paper

CCD-CBT: Multi-Agent Therapeutic Interaction for CBT Guided by Cognitive Conceptualization Diagram

Chang Liu, Changsheng Ma, Yongfeng Tao, Bin Hu, Minqiang Yang · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback

Simulation Env Multi Agent Medicine

However, existing methods often rely on static cognitive profiles and omniscient single-agent simulation, failing to capture the dynamic, information-asymmetric nature of real therapy.
We introduce CCD-CBT, a multi-agent framework that shifts CBT simulation along two axes: 1) from a static to a dynamically reconstructed Cognitive Conceptualization Diagram (CCD), updated by a dedicated Control Agent, and 2) from omniscient…

Open paper

From High-Dimensional Spaces to Verifiable ODD Coverage for Safety-Critical AI-based Systems

Thomas Stefani, Johann Maximilian Christensen, Elena Hoemann, Frank Köster, Sven Hallerbach · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback

Simulation Env Long Horizon Math

While Artificial Intelligence (AI) offers transformative potential for operational performance, its deployment in safety-critical domains such as aviation requires strict adherence to rigorous certification standards.
Ultimately, this method enables the validation of ODD coverage in higher dimensions, advancing a Safety-by-Design approach while complying with EASA's standards.

Open paper

Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

Xuanbo Su, Wenhao Hu, Haibo Su, Yunzhang Chen, Le Zhan, Yanqi Yang · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Human EvalSimulation Env General

We introduce SalesLLM benchmark, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with…
We propose a fully automatic evaluation pipeline that combines (i) an LLM-based rater for sales-process progress,and (ii) fine-tuned BERT classifiers for end-of-dialogue buying intent.

Open paper

From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection

Hongxu Zhou · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Coding).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Critique Edit Coding

While structured feedback can mitigate this issue, existing approaches often rely on externally trained critics or symbolic tools, reducing agent autonomy.
This observation helps explain why the agent achieves near-perfect superficial syntactic alignment yet fails to detect or resolve deeper semantic errors.

Open paper

Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis

Michael Cuccarese · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Coding).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Demonstrations Coding

This paper presents epistemic blinding in the context of an agentic system that uses large language models to reason across multiple biological datasets for drug target prioritization.
The complete target identification system is described - including LLM-guided evolutionary optimization of scoring functions and blinded agentic reasoning for target rationalization - with demonstration that both stages operate without…

Open paper

The Self Driving Portfolio: Agentic Architecture for Institutional Asset Management

Andrew Ang, Nazym Azimbayev, Andrey Kim · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Coding).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Critique Edit Coding

Agentic AI shifts the investor's role from analytical execution to oversight.
We present an agentic strategic asset allocation pipeline in which approximately 50 specialized agents produce capital market assumptions, construct portfolios using over 20 competing methods, and critique and vote on each other's output.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent