Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 31 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs

Yuxuan Lu, Ziyi Wang, Yingzhou Lu, Yisi Sang, Jiri Gesi, Xianfeng Tang · May 17, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Pairwise Preference Simulation Env Long Horizon General
  • Training tool-calling agents requires large-scale trajectory data with verifiable labels, yet existing approaches either synthesize environments that diverge from real API behavior or generate tasks without ground-truth outcomes for…
  • To address environment drift in live APIs, we construct a retrieval-augmented simulator that caches all exploration results and replays them during training and evaluation, enabling fully offline and reproducible RL.
Open paper
Democratizing Tool Learning with Environments Fully Simulated by a Free 8B Language Model

Chenming Tang, Hsiu-Yuan Huang, Weijie Liu, Junqiang Zheng, Saiyong Yang, Yunfang Wu · Apr 20, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Simulation Env Long Horizon General
  • Reinforcement learning (RL) has become a prevalent paradigm for training tool calling agents, which typically requires online interactive environments.
  • In this work, we propose TRUSTEE, a cost-friendly method for training tool calling agents with dynamic environments fully simulated by free open-source LMs that can be as small as 8B, including task generation, user simulation, tool…
Open paper
MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation

Taolin Han, Shuang Wu, Jinghang Wang, Yuhao Zhou, Renquan Lv, Bing Zhao · Mar 26, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Automatic MetricsSimulation Env Long Horizon General
  • Current scientific evaluation benchmarks predominantly rely on static, single-turn Question Answering (QA) formats, which are inadequate for measuring model performance in complex scientific tasks that require multi-step iteration and…
  • To address this gap, we introduce MolQuest, a novel agent-based evaluation framework for molecular structure elucidation built upon authentic chemical experimental data.
Open paper

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Demonstrations Human EvalLlm As Judge Long Horizon General
  • LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
  • We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation.
Open paper

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Automatic MetricsSimulation Env Long Horizon General
  • We introduce DECEPTGUARD, a unified framework that systematically compares three monitoring regimes: black-box monitors (actions and outputs only), CoT-aware monitors (additionally observing the agent's chain-of-thought reasoning trace),…
  • We introduce DECEPTSYNTH, a scalable synthetic pipeline for generating deception-positive and deception-negative agent trajectories across a novel 12-category taxonomy spanning verbal, behavioral, and structural deception.
Open paper
LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation

Feiyu Duan, Xuanjing Huang, Zhongyu Wei · Mar 12, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Simulation Env Long Horizon General
  • However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states.
  • Based on LifeSim, we introduce LifeSim-Eval, a comprehensive benchmark for multi-scenario, long-horizon personalized assistance.
Open paper

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Simulation Env Long Horizon General
  • Large Language Models (LLMs) are increasingly used to power autonomous agents for complex, multi-step tasks.
  • We propose simulation-in-the-loop, an interaction paradigm that enables users and agents to explore simulated future trajectories before committing to decisions.
Open paper
ReDAct: Uncertainty-Aware Deferral for LLM Agents

Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov, Nikita Glazkov, Ivan Nasonov, Ilya Makarov · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Simulation Env Long Horizon General
  • Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems.
  • In ReDAct, an agent is equipped with two LLMs: a small, cheap model used by default, and a large, more reliable but expensive model.
Open paper
Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications

Che Chen, Lanhua Li, Shimin Gong, Yu Zhao, Yuming Fang, Dusit Niyato · Mar 23, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Simulation Env Long Horizon General
  • To maximize the overall throughput, we first propose a delay-tolerant multi-agent deep reinforcement learning (MADRL) algorithm that integrates a delay-penalized reward to encourage information sharing among UAVs, while jointly optimizing…
Open paper
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright, Marcus Yearwood · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon General
  • Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
  • We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations.
Open paper
Contextualized Privacy Defense for LLM Agents

Yule Wen, Yanzhe Zhang, Jianxun Lian, Xiaoyuan Yi, Xing Xie, Diyi Yang · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Simulation Env Long Horizon General
  • LLM agents increasingly act on users' personal information, yet existing privacy defenses remain limited in both design and adaptability.
  • These paradigms are insufficient for supporting contextual, proactive privacy decisions in multi-step agent execution.
Open paper
AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Cheng Jiayang, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao, Yangqiu Song · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Simulation Env Long Horizon General
  • Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory.
  • To address these gaps, we introduce AMemGym, an interactive environment enabling on-policy evaluation and optimization for memory-driven personalization.
Open paper
ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li, Kaiqiao Han · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Simulation Env Long Horizon General
  • Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks.
  • Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL.
Open paper
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding, Gedas Bertasius · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Simulation Env Long Horizon General
  • We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.
  • Furthermore, real-world evaluations across 8 long-horizon tasks demonstrate an average success rate of 85%.
Open paper
SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray, Hua Wei · Feb 24, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Simulation Env Long Horizon General
  • Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning.
  • We introduce SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design.
Open paper
Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Simulation Env Long Horizon General
  • However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior.
  • To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework.
Open paper
Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving

Yinan Zheng, Tianyi Tan, Bin Huang, Enguang Liu, Ruiming Liang, Jianlin Zhang · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Simulation Env Long Horizon General
  • However, their applications and evaluations in autonomous driving remain limited to simulation-based or laboratory settings.
  • Moreover, we also provide an effective reinforcement learning post-training strategy to further enhance the safety of the learned planner.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.