Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 80 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Match reason: Matches selected tags (Long Horizon, General).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Rubric Rating Automatic Metrics Long Horizon General
  • Multi-agent debate (MAD), and more broadly closed-system reasoning where agents iteratively transform each other's outputs, tends to preserve answer accuracy while degrading the reasoning behind those answers.
  • An R6 cohort study (Korean n=10x30 FEVER; English n=3x200 SciFact) finds inter-rater Fleiss kappa <= +0.018 with 0.8-1.4 Likert intra-rater shifts across language and domain -- the human agreement that faithfulness metrics have been…
Open paper
Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

Amin Karimi Monsefi, Dominic Culver, Nikhil Bhendawade, Manuel R. Ciosici, Yizhe Zhang, Irina Belousova · May 8, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon General
  • Each training trajectory is built through a chain of blind stochastic jumps with no evaluation of sequence quality; a single bad decision at an early midpoint propagates through subsequent steps, yet the student must imitate the result.
Open paper
SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following

Beatriz Canaverde, Duarte M. Alves, José Pombal, Giuseppe Attanasio, André F. T. Martins · May 7, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon General
  • Yet most instruction-following benchmarks focus on single-turn or short multi-turn scenarios, leaving open how well models handle long-horizon instruction-following tasks.
  • To bridge this gap, we present SEQUOR, an automatic benchmark for evaluating constraint adherence in long multi-turn conversations.
Open paper
Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon General
  • Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process, in which the large language model (LLM) acts as a search agent that generates intermediate thoughts and subqueries to iteratively interact with…
  • Extensive experiments on seven benchmark datasets show that LatentRAG achieves performance comparable to explicit agentic RAG methods while reducing inference latency by approximately 90%, substantially narrowing the latency gap with…
Open paper
Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri, Tara Khani · Apr 23, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 65% High protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon General
  • The transition from stateless language model inference to persistent, multi session autonomous agents has revealed memory to be a primary architectural bottleneck in the deployment of production grade agentic systems.
  • Through systematic benchmarking on the LongMemEval and LoCoMo evaluation suites, Memanto achieves state of the art accuracy scores of 89.8 percent and 87.1 percent respectively.
Open paper
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Red Team Automatic Metrics Long Horizon General
  • As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
  • To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety.
Open paper
Signals: Trajectory Sampling and Triage for Agentic Interactions

Shuguang Chen, Adil Hafeez, Salman Paracha · Apr 1, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Long Horizon General
  • We propose a lightweight, signal-based framework for triaging agentic interaction trajectories.
  • In a controlled annotation study on τ-bench, a widely used benchmark for tool-augmented agent evaluation, we show that signal-based sampling achieves an 82\% informativeness rate compared to 74\% for heuristic filtering and 54\% for random…
Open paper
Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing

Wenhao Yuan, Chenchen Lin, Jian Chen, Jinfeng Xu, Xuehe Wang, Edith Cheuk Han Ngai · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon General
  • In large language model (LLM) agents, reasoning trajectories are treated as reliable internal beliefs for guiding actions and updating memory.
  • In this paper, inspired by the vulnerability of unfaithful intermediate reasoning trajectories, we propose Self-Audited Verified Reasoning (SAVeR), a novel framework that enforces verification over internal belief states within the agent…
Open paper
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

Zhifei Xie, Zongzheng Hu, Fangda Ye, Xin Zhang, Haobo Chai, Zihang Liu · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon General
  • Prior work remains largely confined to laboratory settings, leaving a clear gap in real-world proactive agent: depth, complexity, ambiguity, precision and real-time constraints.
  • We first propose DD-MM-PAS (Demand Detection, Memory Modeling, Proactive Agent System) as a general paradigm for streaming proactive AI agent.
Open paper
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents

Shu Wang, Edwin Yu, Oscar Love, Tom Zhang, Tom Wong, Steve Scargall · Apr 6, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon General
  • Large Language Model (LLM) agents require persistent memory to maintain personalization, factual continuity, and long-horizon reasoning, yet standard context-window and retrieval-augmented generation (RAG) pipelines degrade over…
  • Across benchmarks, MemMachine achieves strong accuracy-efficiency tradeoffs: on LoCoMo it reaches 0.9169 using gpt4.1-mini; on LongMemEvalS (ICLR 2025), a six-dimension ablation yields 93.0 percent accuracy, with retrieval-stage…
Open paper
Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Tool Use General
  • We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use.
  • Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains.
Open paper

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon General
  • Long-horizon conversational agents require persistent memory for coherent reasoning, yet uncontrolled accumulation causes temporal decay and false memory propagation.
  • Benchmarks such as LOCOMO and LOCCO report performance degradation from 0.455 to 0.05 across stages, while MultiWOZ shows 78.2% accuracy with 6.8% false memory rate under persistent retention.
Open paper

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon General
  • Evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset demonstrate real-time computational efficiency suitable for production systems; additional validation on public datasets such as View of Delft (VoD) further confirms cross-dataset…
Open paper
$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi, Sachin Patro · Apr 1, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon General
  • As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound.
  • We introduce YC-Bench, a benchmark that evaluates these capabilities by tasking an agent with running a simulated startup over a one-year horizon spanning hundreds of turns.
Open paper
Asymmetric Actor-Critic for Multi-turn LLM Agents

Shuli Jiang, Zhaoyang Zhang, Yi Zhang, Shuo Yang, Wei Xia, Stefano Soatto · Mar 31, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon General
  • In many real-world applications, agents must succeed in one-shot settings where retries are impossible.
  • We propose an asymmetric actor-critic framework for reliable conversational agents.
Open paper
AgentSwing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents

Zhaopeng Feng, Liangcai Su, Zhen Zhang, Xinyu Wang, Xiaotian Zhang, Xiaobin Wang · Mar 29, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon General
  • As large language models (LLMs) evolve into autonomous agents for long-horizon information-seeking, managing finite context capacity has become a critical bottleneck.
  • Building on this perspective, we propose AgentSwing, a state-aware adaptive parallel context management routing framework.
Open paper
PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

Shaoxuan Li, Zhixuan Zhao, Hanze Deng, Zirun Ma, Shulin Tian, Zuyan Liu · Mar 27, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, General).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon General
  • We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning.
  • The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.