Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 25 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation

Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou, Hanrong Zhang · Apr 1, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Simulation Env).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Critique Edit Simulation Env Long Horizon Coding
  • As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution…
  • In this paper, we present the first systematic study of interruptible agents in long-horizon, environmentally grounded web navigation tasks, where actions induce persistent state changes.
Open paper
PRBench: End-to-end Paper Reproduction in Physics Research

Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li · Mar 29, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Simulation Env).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Rubric RatingExpert Verification Automatic MetricsSimulation Env Coding
  • We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics.
  • Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution.
Open paper
VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents

Yuhao Chen, Yi Xu, Xinyun Ding, Xiang Fang, Shuochen Liu, Luxi Lin · Mar 25, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Simulation Env).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Pairwise Preference Simulation Env Tool Use Coding
  • With the growing demand for intelligent in-vehicle experiences, vehicle-based agents are evolving from simple assistants to long-term companions.
  • To address this gap, we introduce VehicleMemBench, a multi-user long-context memory benchmark built on an executable in-vehicle simulation environment.
Open paper

Match reason: Matches selected tags (Coding, Simulation Env).

Score: 65% High protocol signal Freshness: Hot Status: Fallback
Simulation Env Multi Agent Coding
  • We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning…
  • We additionally contribute a fully functional 4-player Ludo simulator supporting Random, Heuristic, Game-Theory, and LLM agents.
Open paper
Citations: 0

Match reason: Matches selected tags (Coding, Simulation Env).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Human EvalSimulation Env Long Horizon Coding
  • Within this framework, we construct framing-sensitive agent personas by fine-tuning language models with framing-conditioned loss attenuation, inducing targeted biases while preserving overall task competence.
  • Human evaluation further confirms that FrameRef's generated framings measurably affect human judgment.
Open paper
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents

Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu, Liqun Liu · Feb 15, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Simulation Env).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Expert Verification Simulation Env Long Horizon Coding
  • While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem.
  • To address this gap, we propose AD-Bench, a benchmark designed based on real-world business requirements of advertising and marketing platforms.
Open paper
Citations: 0

Match reason: Matches selected tags (Coding, Simulation Env).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Red Team Simulation Env Coding
  • Large language model (LLM) safety evaluation is moving from content moderation to action security as modern systems gain persistent state, tool access, and autonomous control loops.
  • We present AJAR, a red-teaming framework that exposes multi-turn jailbreak algorithms as callable MCP services and lets an Auditor Agent orchestrate them inside a tool-aware runtime built on Petri.
Open paper
Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, Bo An · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Simulation Env).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Simulation Env Long Horizon Coding
  • Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks.
  • To address the issue, in this paper, we propose Hierarchy-of-Groups Policy Optimization (HGPO) for long-horizon agentic tasks.
Open paper
Self-Correcting VLA: Online Action Refinement via Sparse World Imagination

Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li, Guoli Yang · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Simulation Env).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Simulation Env Long Horizon Coding
  • Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states.
  • Evaluations on challenging robot manipulation tasks from simulation benchmarks and real-world settings demonstrate that SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher s
Open paper
MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani, Babak Khalaj · Feb 18, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Simulation Env).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Simulation Env Multi Agent Coding
  • MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation.
  • Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning.
Open paper
Citations: 0

Match reason: Matches selected tags (Coding, Simulation Env).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Simulation Env Multi Agent Coding
  • We present OR-Agent, a configurable multi-agent research framework designed for automated exploration in rich experimental environments.
  • OR-Agent organizes research as a structured tree-based workflow that explicitly models branching hypothesis generation and systematic backtracking, enabling controlled management of research trajectories beyond simple mutation-crossover loo
Open paper
HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns

Xintao Wang, Jian Yang, Weiyuan Li, Rui Xie, Jen-tse Huang, Jun Gao · Jan 15, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Simulation Env).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic MetricsSimulation Env Coding
  • We present HumanLLM, a framework treating psychological patterns as interacting causal forces.
  • Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment (r=0.90) while revealing that holistic metrics conflate simulation accuracy with social desirability.
Open paper
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary · Oct 5, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Simulation Env).

Score: 53% High protocol signal Freshness: Cold Status: Ready
Rubric Rating Automatic MetricsSimulation Env Coding
  • We present a principled Bayesian evaluation framework that replaces Pass@k and average accuracy over N trials (avg@N) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and…
  • Together, these results recommend replacing Pass@k for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit.
Open paper
On Discovering Algorithms for Adversarial Imitation Learning

Shashank Reddy Chirra, Jayden Teoh, Praveen Paruchuri, Pradeep Varakantham · Oct 1, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Simulation Env).

Score: 50% Moderate protocol signal Freshness: Cold Status: Ready
Demonstrations Simulation Env Coding
  • RA functions in AIL are typically derived from divergence minimization objectives, relying heavily on human design and ingenuity.
  • Remarkably, DAIL generalises across unseen environments and policy optimization algorithms, outperforming the current state-of-the-art of \emph{human-designed} baselines.
Open paper
World Simulation with Video Foundation Models for Physical AI

NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji · Oct 28, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Simulation Env).

Score: 50% Moderate protocol signal Freshness: Cold Status: Fallback
Simulation Env Long Horizon CodingMultilingual
  • These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
  • To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and…
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.