Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 23 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Signals: Trajectory Sampling and Triage for Agentic Interactions

Shuguang Chen, Adil Hafeez, Salman Paracha · Apr 1, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Long Horizon General
  • We propose a lightweight, signal-based framework for triaging agentic interaction trajectories.
  • In a controlled annotation study on τ-bench, a widely used benchmark for tool-augmented agent evaluation, we show that signal-based sampling achieves an 82\% informativeness rate compared to 74\% for heuristic filtering and 54\% for random…
Open paper

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Long Horizon General
  • Autonomous agents operating in continuous environments must decide not only what to do, but when to act.
  • High spread indicates a branching, uncertain future and drives the agent to act sooner; low spread signals predictability and permits longer rest intervals.
Open paper
LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation

Feiyu Duan, Xuanjing Huang, Zhongyu Wei · Mar 12, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Simulation Env Long Horizon General
  • However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states.
  • Based on LifeSim, we introduce LifeSim-Eval, a comprehensive benchmark for multi-scenario, long-horizon personalized assistance.
Open paper

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Simulation Env Long Horizon General
  • Large Language Models (LLMs) are increasingly used to power autonomous agents for complex, multi-step tasks.
  • We propose simulation-in-the-loop, an interaction paradigm that enables users and agents to explore simulated future trajectories before committing to decisions.
Open paper
Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel Alumäe, Mathew Magimai Doss · Mar 6, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Long Horizon Multilingual
  • We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14…
Open paper
Citations: 0

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Long Horizon General
  • Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
  • Extensive experiments on the LaMP-QA benchmark using three LLMs show that PR2 consistently outperforms strong baselines, achieving an average relative improvement of 8.8%-12% in personalized QA.
Open paper

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Long Horizon Multilingual
  • The methodological trajectory moves from classical supervised adaptation for task-specific demands to decoding-time alignment for safety, finally leveraging human feedback and preference modeling to achieve sociolinguistic acuity.
Open paper
Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

Jingyi Xu, Xingyu Ren, Zhoupeng Shou, Yumeng Zhang, Zhiqiang You · Jan 24, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Long Horizon Coding
  • To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent.
  • We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data.
Open paper
PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning

Bingxuan Li, Jeonghwan Kim, Cheng Qian, Xiusi Chen, Eitan Anzenberg, Niran Kundapur · Jan 17, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Long Horizon General
  • To enable a systematic study of this question, we introduce CalConflictBench, a benchmark for long-horizon calendar conflict resolution.
  • To address this gap, we propose PEARL, a reinforcement-learning framework that (i) augments the language agent with an external preference memory that stores and updates inferred strategies (e.g., attendee priorities, topic importance,…
Open paper
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang · Jan 14, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Simulation Env Long Horizon General
  • Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
  • Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3% reduced inference latency over state-of-the-art reasoning VLAs, while maintainin
Open paper
Self-Debias: Self-correcting for Debiasing Large Language Models

Xuan Feng, Shuai Zhao, Luwei Xiao, Tianlong Gu, Bo An · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Long Horizon General
  • Unlike standard preference optimization which applies broad penalties, Self-Debias employs a fine-grained trajectory-level objective subject to dynamic debiasing constraints.
Open paper
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

Shiwan Zhao, Zhihu Wang, Xuyang Zhao, Jiaming Zhou, Caiyue Xu, Chenfei Liu · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Long Horizon General
  • Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines.
  • SFT may serve either support expansion or policy reshaping, whereas preference-based methods are usually off-policy reshaping.
Open paper

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Long Horizon General
  • Personalized large language models (PLLMs) have garnered significant attention for their ability to align outputs with individual's needs and preferences.
  • Extensive evaluations on long-horizon benchmarks using the Qwen-3 model family (4B to 32B) validate the effectiveness of TSUBASA, surpassing competitive memory-augmented systems that rely primarily on memory writing, such as Mem0 and…
Open paper
TREX: Trajectory Explanations for Multi-Objective Reinforcement Learning

Dilina Rajapakse, Juan C. Rosero, Ivana Dusparic · Mar 23, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Long Horizon General
  • Multi-Objective Reinforcement Learning (MORL) addresses this limitation by enabling agents to optimize several objectives simultaneously, explicitly reasoning about trade-offs between them.
  • Current Explainable Reinforcement Learning (XRL) methods are typically designed for single scalar rewards and do not account for explanations with respect to distinct objectives or user preferences.
Open paper
AgenticRec: End-to-End Tool-Integrated Policy Optimization for Ranking-Oriented Recommender Agents

Tianyi Li, Zixuan Wang, Guidong Lei, Xiaodong Li, Hui Li · Mar 23, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Tool Use General
  • To address this, we present AgenticRec, a ranking-oriented agentic recommendation framework that optimizes the entire decision-making trajectory (including intermediate reasoning, tool invocation, and final ranking list generation) under…
  • Third, we introduce Progressive Preference Refinement (PPR) to resolve fine-grained preference ambiguities.
Open paper
HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning

Zhicong Lu, Zichuan Lin, Wei Jia, Changyuan Tian, Deheng Ye, Peiguang Li · Mar 19, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Long Horizon General
  • While large language models excel in diverse domains, their performance on complex longhorizon agentic decision-making tasks remains limited.
  • To emphasize significant segments in the trajectory, a hindsight model is devised to reflect the preference of performing a certain action after knowing the trajectory outcome.
Open paper
Citations: 0

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Long Horizon General
  • When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed.
  • Hence, we introduce Hierarchical Reward Design from Language (HRDL): a problem formulation that extends classical reward design to encode richer behavioral specifications for hierarchical RL agents.
Open paper
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian, Haihua Yang · Feb 13, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise PreferenceRubric Rating Long Horizon Medicine
  • MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities.
  • For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision…
Open paper
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright, Marcus Yearwood · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon General
  • Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
  • We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations.
Open paper
Citations: 0

Match reason: Matches selected tags (Long Horizon, Pairwise Preference).

Score: 53% High protocol signal Freshness: Cold Status: Ready
Pairwise Preference Automatic Metrics Long Horizon General
  • Humor, as both a creative human activity and a social binding mechanism, has long posed a major challenge for AI generation.
  • It integrates visual semantic parsing, humor- and psychology-based reasoning, and a fine-tuned discriminator for humor evaluation, forming an interpretable and controllable cognitive reasoning chain.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.