Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 213 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

Alejandra Zambrano, Sara Vera Marjanovic, Imene Kerboua, Xing Han Lù, Leila Kosseim · May 28, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Simulation Env General
  • To address this, we introduce PlanAhead, a static planner-executor framework that evaluates the impact of plan representation in agent performance.
  • To account for stochastic variability, we introduce two novel evaluation metrics: Achievement Rate (AR) and Solved-Task Consistency (STC).
Open paper
Revisiting Observation Reduction for Web Agents: Comprehensive Evaluation with a Lightweight Framework

Masafumi Enomoto, Ryoma Obara, Haochen Zhang, Masafumi Oyamada · May 28, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • HTML observations in LLM-based web agents are extremely long, and while many reduction methods have been proposed, it remains unclear which methods reduce overall agent latency while maintaining performance.
  • To address this, we propose a lightweight evaluation framework based on the Minimal Failure Set (MFS), the minimal set of HTML elements whose removal causes task failure.
Open paper
Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

Zhengyang Tang, Yi Zhang, Chenxin Li, Xin Lai, Pengyuan Lyu, Yiduo Guo · May 8, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • When a phone-use agent avoids harm, does that show safety, or simply inability to act?
  • We evaluate eight representative phone-use agents under this framework.
Open paper
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Asaf Yehudai, Lilach Eden, Michal Shmueli-Scheuer · May 21, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework.
  • In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high-quality, data-driven, insightful feedback.
Open paper
Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Simulation Env General
  • Web agents perceive web pages through an observation space, yet its granularity has remained an underexamined design choice.
  • Moreover, we propose PageDigest, a web-specific inference pipeline that delivers this region-level observation to the actor agent as a compact per-page digest that persists across steps.
Open paper
Milestone-Guided Policy Learning for Long-Horizon Language Agents

Zixuan Wang, Yuchen Yan, Hongxing Li, Teng Pan, Dingming Li, Ruiqing Zhang · May 7, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Fallback
Simulation Env Long Horizon Coding
  • While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging.
  • BEACON partitions trajectories at milestone boundaries, applies temporal reward shaping within segments to credit partial progress, and estimates advantages at dual scales to prevent distant failures from corrupting the evaluation of local…
Open paper
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu · May 28, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 68% Moderate protocol signal Freshness: Hot Status: Ready
Demonstrations Simulation Env Long Horizon General
  • Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data,…
  • Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot…
Open paper
Orchard: An Open-Source Agentic Modeling Framework

Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Xiao Yu, Rui Yang · May 14, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 68% High protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Tool Use LawCoding
  • Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments.
  • We present Orchard, an open-source framework for scalable agentic modeling.
Open paper
Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 57% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Although advances such as chain-of-thought, tree-of-thought or reinforcement learning have improved the performance of LLMs in reasoning and planning tasks, they are still brittle and have not achieved human-level performance in many…
  • This procedure is tested and compared using several benchmarks in language-based planning and general reasoning.
Open paper

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 57% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Multilingual
  • This paper presents the first benchmark for the task of automatic part-of-speech (POS) tagging for the Tajik language.
  • Zero-shot evaluation revealed the greatest typological proximity of Tajik to Persian (ParsBERT) and Russian (ruBERT).
Open paper
From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification

Md Erfan, Md Kamal Hossain Chowdhury, Ahmed Ryan, Md Rayhanur Rahman · Apr 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 57% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics MathCoding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows

Xinglin Wang, Zishen Liu, Shaoxiong Feng, Peiwen Yuan, Yiwei Li, Jiayi Shi · May 7, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 54% Sparse protocol signal Freshness: Warm Status: Ready
Math
  • Agentic systems increasingly solve complex user requests by executing orchestrated workflows, where subtasks are assigned to specialized models or tools and coordinated according to their dependencies.
  • While recent work improves agent efficiency by optimizing the performance--cost--latency frontier, real deployments often impose concrete requirements: a workflow must be completed within a specified budget and before a specified deadline.
Open paper
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Qijun Han, Haoqin Tu, Zijun Wang, Haoyue Dai, Yiyang Zhou, Nancy Lau · Apr 23, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 54% Sparse protocol signal Freshness: Warm Status: Ready
Coding
  • We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search.
  • We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena).
Open paper
From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

Zhengxu Yu, Yu Fu, Zhiyuan He, Yuxuan Huang, Lee Ka Yiu, Meng Fang · Apr 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Multi Agent General
  • Individual agent capabilities have advanced rapidly through modular skills and tool integrations, yet multi-agent systems remain constrained by fixed team structures, tightly coupled coordination logic, and session-bound learning.
  • To fill this gap, we introduce OneManCompany (OMC), a framework that elevates multi-agent systems to the organisational level.
Open paper
Ask Now, Use Later: Benchmarking the Proactivity Gap in Long-Lived LLM Agents

Bin Wu, Guanyun Zou, Bingbing Wang, Huan Zhao, Chuan Shi · May 27, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% High protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics Law
  • A long-lived LLM agent, such as OpenClaw, earns its value by acting on a user's preferences and constraints across sessions, not just the current request.
  • Yet today's agents keep what a user volunteers but rarely ask for what stays unspoken, leaving a proactivity gap in long-lived LLM agents: an agent cannot act on a preference it never obtained.
Open paper
Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency

Andrew Ivan Soegeng, Patrick Sutanto, Tan Sang Nguyen · May 21, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Fallback
Critique Edit Multilingual
  • Evaluations on the BLEnD benchmark demonstrate that our approach significantly improves cultural alignment-boosting performance on English queries by an average of 5.03%-relying entirely on self-generated data.
Open paper
FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting

Marco Obermeier, Marco Pruckner, Florian Haselbeck, Andreas Zeiselmair · Apr 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • We address this gap by presenting the Foundation Models in Energy Time Series Forecasting (FETS) benchmark.
  • We (1) provide a structured overview of energy forecasting use cases along three main dimensions: stakeholders, attributes, and data categories; (2) collect and analyze 54 datasets across 9 data categories, guided by typical stakeholder…
Open paper
Revisiting Graph-Tokenizing Large Language Models: A Systematic Evaluation of Graph Token Understanding

Zhongjian Zhang, Yue Yu, Mengmei Zhang, Junping Du, Xiao Wang, Chuan Shi · May 5, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Motivated by this question, we formalize a unified framework for GTokenLLMs and propose an evaluation pipeline, GTEval, to assess graph-token understanding via instruction transformations at the format and content levels.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.