Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 55 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Ready
Demonstrations Human EvalLlm As Judge Long Horizon General
  • LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
  • We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation.
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Long Horizon General
  • Personalized large language models (PLLMs) have garnered significant attention for their ability to align outputs with individual's needs and preferences.
  • Extensive evaluations on long-horizon benchmarks using the Qwen-3 model family (4B to 32B) validate the effectiveness of TSUBASA, surpassing competitive memory-augmented systems that rely primarily on memory writing, such as Mem0 and…
Open paper
Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Tool Use General
  • We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use.
  • Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains.
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon MathCoding
  • Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval.
  • Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism.
Open paper
Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon General
  • Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process, in which the large language model (LLM) acts as a search agent that generates intermediate thoughts and subqueries to iteratively interact with…
  • Extensive experiments on seven benchmark datasets show that LatentRAG achieves performance comparable to explicit agentic RAG methods while reducing inference latency by approximately 90%, substantially narrowing the latency gap with…
Open paper
Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri, Tara Khani · Apr 23, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 65% High protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon General
  • The transition from stateless language model inference to persistent, multi session autonomous agents has revealed memory to be a primary architectural bottleneck in the deployment of production grade agentic systems.
  • Through systematic benchmarking on the LongMemEval and LoCoMo evaluation suites, Memanto achieves state of the art accuracy scores of 89.8 percent and 87.1 percent respectively.
Open paper
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu · Apr 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback
Simulation Env Long Horizon Law
  • Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities.
  • Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific…
Open paper
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 58% High protocol signal Freshness: Warm Status: Ready
Red Team Automatic Metrics Long Horizon General
  • As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
  • To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety.
Open paper
Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design

Bin Zhu, Qianghuai Jia, Tian Lan, Junyang Ren, Feng Gu, Feihu Jiang · Mar 30, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Long Horizon General
  • Deep research agents autonomously conduct open-ended investigations, integrating complex information retrieval with multi-step reasoning across diverse sources to solve real-world problems.
  • To address this, we present Marco DeepResearch, a deep research agent optimized with a verification-centric framework design at three levels: (1)~QA Data Synthesis: We introduce verification mechanisms to graph-based and agent-based QA…
Open paper
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai, Zixian Huang · Mar 30, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Critique Edit Long Horizon General
  • We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe.
  • On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness,…
Open paper

Match reason: Matched by broad semantic/index fallback.

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Long Horizon Law
  • AI agents, autonomous digital actors, need agent-native protocols; existing methods include GUI automation and MCP-based skills, with defects of high token consumption, fragmented interaction, inadequate security, due to lacking a unified…
  • To address these issues, we present ANX, an open, extensible, verifiable agent-native protocol and top-level framework integrating CLI, Skill, MCP, resolving pain points via protocol innovation, architectural optimization and tool…
Open paper
WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

Fanheng Kong, Jingyuan Zhang, Yang Yue, Chenxi Sun, Yang Tian, Shi Feng · Mar 26, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Long Horizon Coding
  • To address these gaps, we introduce WebTestBench, a benchmark for evaluating end-to-end automated web testing.
  • These findings expose a substantial gap between current computer-use agent capabilities and industrial-grade deployment demands.
Open paper
SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

Gabriel Orlanski, Devjeet Roy, Alexander Yun, Changho Shin, Alex Gu, Albert Ge · Mar 25, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Long Horizon Coding
  • We introduce SlopCodeBench, a language-agnostic benchmark comprising 20 problems and 93 checkpoints, in which agents repeatedly extend their own prior solutions under evolving specifications that force architectural decisions without…
  • No agent solves any problem end-to-end across 11 models; the highest checkpoint solve rate is 17.2%.
Open paper
ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation

Jungwoo Oh, Hyunseung Chung, Junhee Lee, Min-Gyu Kim, Hangyul Yoon, Ki Seong Lee · Mar 15, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 52% Sparse protocol signal Freshness: Warm Status: Ready
Long Horizon LawMedicine
  • To investigate this, we introduce ECG-Reasoning-Benchmark, a novel multi-turn evaluation framework comprising over 6,400 samples to systematically assess step-by-step reasoning across 17 core ECG diagnoses.
  • Our comprehensive evaluation of state-of-the-art models reveals a critical failure in executing multi-step logical deduction.
Open paper
$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi, Sachin Patro · Apr 1, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon General
  • As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound.
  • We introduce YC-Bench, a benchmark that evaluates these capabilities by tasking an agent with running a simulated startup over a one-year horizon spanning hundreds of turns.
Open paper
Asymmetric Actor-Critic for Multi-turn LLM Agents

Shuli Jiang, Zhaoyang Zhang, Yi Zhang, Shuo Yang, Wei Xia, Stefano Soatto · Mar 31, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon General
  • In many real-world applications, agents must succeed in one-shot settings where retries are impossible.
  • We propose an asymmetric actor-critic framework for reliable conversational agents.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.