Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 218 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Simulation Env General
  • Web agents perceive web pages through an observation space, yet its granularity has remained an underexamined design choice.
  • Moreover, we propose PageDigest, a web-specific inference pipeline that delivers this region-level observation to the actor agent as a compact per-page digest that persists across steps.
Open paper
Measuring and Mitigating the Distributional Gap Between Real and Simulated User Behaviors

Shuhaib Mehri, Philippe Laban, Sumuk Shashidhar, Marwa Abdulhai, Sergey Levine, Michel Galley · May 8, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Simulation Env Coding
  • As user simulators are increasingly used for interactive training and evaluation of AI assistants, it is essential that they represent the diverse behaviors of real users.
  • In this work, we introduce a method to measure the distributional gap between real and simulated user behaviors, validated through a human study and ablations.
Open paper
CktFormalizer: Autoformalization of Natural Language into Circuit Representations

Jing Xiong, Qi Han, Chenchen Ding, He Xiao, Zunhai Su, Chaofan Tao · May 8, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready
Simulation Env Math
  • Lean serves three roles: (i) type checker:dependent types encode bit-width constraints, case coverage, and acyclicity, turning hardware defects into compile-time errors that guide iterative repair; (ii) correctness firewall:compiled designs…
Open paper

Match reason: Matches selected tags (Simulation Env).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready
Simulation Env Math
  • With a 7B-parameter LLM whose weights are entirely frozen, CIKA achieves 69.7\% on the contamination-free Omni-MATH-Rule benchmark and 64.0\% overall, compared to 60.5\% for o1-mini, and 97.2\% on GSM8K, 46--50\% on AIME 2024--2026, and…
Open paper
Belief Memory: Agent Memory Under Partial Observability

Junfeng Liao, Qizhou Wang, Jianing Zhu, Bo Du, Rui Yan, Xiuying Chen · May 7, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready
Simulation Env General
  • LLM agents that operate over long context depend on external memory to accumulate knowledge over time.
  • By committing to one conclusion and discarding uncertainty, these methods introduce self-reinforcing error: the agent acts on the stored conclusion, never revisits alternatives, and reinforces the conclusion over time.
Open paper
LeHome: A Simulation Environment for Deformable Object Manipulation in Household Scenarios

Zeyi Li, Yushi Yang, Shawn Xie, Kyle Xu, Tianxing Chen, Yuran Wang · Apr 24, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready
Simulation Env General
  • Moreover, LeHome supports multiple robotic embodiments and emphasizes low-cost robots as a core focus, enabling end-to-end evaluation of household tasks on resource-constrained hardware.
Open paper
Semantic Error Correction and Decoding for Short Block Channel Codes

Jiafu Hao, Chentao Yue, Wanchun Liu, Yonghui Li, Branka Vucetic · Apr 24, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready
Simulation Env Coding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Simulation Env Coding
  • A model's ability to reliably process these sources is key to system safety.
  • Our findings reveal general patterns: most models rely more on document assertions than user assertions, and this preference is reinforced by post-training.
Open paper
DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain

Hsuvas Borkakoty, Sebastian Pohl, Cheng Wang, Bei Chen, Yufang Hou · May 8, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready
Simulation Env General
  • LLM-based agents are increasingly deployed for routine but consequential tasks in real-world domains, where their behavior is governed by inherently ambiguous domain policies that admit multiple valid interpretations.
  • We introduce DRIP-R, a benchmark that systematically exploits real-world retail policy ambiguities to construct scenarios in which no single correct resolution exists.
Open paper
The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences

Hubert Plisiecki, Sabina Siudaj, Kacper Dudzic, Anna Sterna, Maciej Gorski, Karolina Drozdz · May 6, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready
Simulation Env General
  • To test this hypothesis at the item level, we introduce the Pinocchio score (π_i), the ratio of inter-model response variance under neutral prompting to that under a human-simulation prompt, as an annotation-free measure of each item's…
Open paper
Associativity-Peakiness Metric for Contingency Tables

Naomi E. Zirkind, William J. Diehl · Apr 24, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready
Simulation Env General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Milestone-Guided Policy Learning for Long-Horizon Language Agents

Zixuan Wang, Yuchen Yan, Hongxing Li, Teng Pan, Dingming Li, Ruiqing Zhang · May 7, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 65% High protocol signal Freshness: Hot Status: Fallback
Simulation Env Long Horizon Coding
  • While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging.
  • BEACON partitions trajectories at milestone boundaries, applies temporal reward shaping within segments to credit partial progress, and estimates advantages at dual scales to prevent distant failures from corrupting the evaluation of local…
Open paper
Democratizing Tool Learning with Environments Fully Simulated by a Free 8B Language Model

Chenming Tang, Hsiu-Yuan Huang, Weijie Liu, Junqiang Zheng, Saiyong Yang, Yunfang Wu · Apr 20, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Simulation Env Long Horizon General
  • Reinforcement learning (RL) has become a prevalent paradigm for training tool calling agents, which typically requires online interactive environments.
  • In this work, we propose TRUSTEE, a cost-friendly method for training tool calling agents with dynamic environments fully simulated by free open-source LMs that can be as small as 8B, including task generation, user simulation, tool…
Open paper
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu · Apr 24, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback
Simulation Env Long Horizon Law
  • Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities.
  • Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific…
Open paper
Time series causal discovery with variable lags

Bruno Petrungaro, Anthony C. Constantinou · Apr 15, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 52% Sparse protocol signal Freshness: Warm Status: Ready
Simulation Env General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering

Weikang Zhang, Zimo Zhu, Zhichuan Yang, Chen Huang, Wenqiang Lei, See-Kiong Ng · Apr 14, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 52% Sparse protocol signal Freshness: Warm Status: Ready
Simulation Env Medicine
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

Xuning Yang, Rishit Dagli, Alex Zook, Hugo Hadfield, Ankit Goyal, Stan Birchfield · Apr 10, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 52% Sparse protocol signal Freshness: Warm Status: Ready
Simulation Env General
  • We introduce RoboLab, a simulation benchmarking framework designed to address these challenges.
  • We introduce an accompanying RoboLab-120 benchmark, consisting of 120 tasks categorized into three competency axes: visual, procedural, relational, across three difficulty levels.
Open paper
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su, Lianghao Deng · Apr 13, 2026

Citations: 0

Match reason: Matches selected tags (Simulation Env).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Simulation Env Multi Agent General
  • We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
  • We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than…
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.