Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 2,736 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,456) General (477) Long Horizon (277) Pairwise Preference (256) Coding (200) Simulation Env (165) Multi Agent (158) Medicine (107) Llm As Judge (94) Expert Verification (91) Human Eval (76) Rubric Rating (75) Web Browsing (73) Math (69) Demonstrations (63) Critique Edit (61)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

ActionParty: Multi-Subject Action Binding in Generative Video Games
Apr 2, 2026 · Citations: 0

However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene.
Steerable Visual Representations
Apr 2, 2026 · Citations: 0

Open the paper page for extracted protocol signals, benchmark mentions, and evaluation context.
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
Apr 2, 2026 · Citations: 0

Open the paper page for extracted protocol signals, benchmark mentions, and evaluation context.
VOID: Video Object and Interaction Deletion
Apr 2, 2026 · Citations: 0

Open the paper page for extracted protocol signals, benchmark mentions, and evaluation context.
Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation
Apr 2, 2026 · Citations: 0

Open the paper page for extracted protocol signals, benchmark mentions, and evaluation context.
Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
Apr 2, 2026 · Citations: 0

Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO.
Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency
Apr 2, 2026 · Citations: 0

Long-horizon conversational agents require persistent memory for coherent reasoning, yet uncontrolled accumulation causes temporal decay and false memory propagation.
The Self Driving Portfolio: Agentic Architecture for Institutional Asset Management
Apr 2, 2026 · Citations: 0

Agentic AI shifts the investor's role from analytical execution to oversight.
Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Apr 2, 2026 · Citations: 0

A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications
Apr 2, 2026 · Citations: 0

Evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset demonstrate real-time computational efficiency suitable for production systems; additional validation on public datasets such as View of Delft (VoD) further confirms cross-dataset…
From High-Dimensional Spaces to Verifiable ODD Coverage for Safety-Critical AI-based Systems
Apr 2, 2026 · Citations: 0

While Artificial Intelligence (AI) offers transformative potential for operational performance, its deployment in safety-critical domains such as aviation requires strict adherence to rigorous certification standards.
Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
Apr 2, 2026 · Citations: 0

Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

ActionParty: Multi-Subject Action Binding in Generative Video Games

Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov, Fabio Pizzati · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Automatic MetricsSimulation Env Multi Agent General

However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene.
We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments.

Open paper

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka, Takeharu Yoshikawa · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Llm As JudgeAutomatic Metrics MedicineMultilingual

A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%).

Open paper

Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning

Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu, Xinyu Dai · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Automatic Metrics General

However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process.
To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM's generation quality.

Open paper

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 45% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Long Horizon Law

Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO.
It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute…

Open paper

Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency

Payal Fofadiya, Sunil Tiwari · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 45% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Long Horizon General

Long-horizon conversational agents require persistent memory for coherent reasoning, yet uncontrolled accumulation causes temporal decay and false memory propagation.
Benchmarks such as LOCOMO and LOCCO report performance degradation from 0.455 to 0.05 across stages, while MultiWOZ shows 78.2% accuracy with 6.8% false memory rate under persistent retention.

Open paper

LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications

Mayank Mayank, Bharanidhar Duraisamy, Florian Geiss · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 45% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Long Horizon General

Evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset demonstrate real-time computational efficiency suitable for production systems; additional validation on public datasets such as View of Delft (VoD) further confirms cross-dataset…

Open paper

Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents

Xuan Qi · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 45% High protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Tool Use General

Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood.
We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark.

Open paper

SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks

Sunder Ali Khowaja, Kapal Dev, Engin Zeydan, Madhusanka Liyanage · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 45% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic MetricsSimulation Env General

Open paper

LLM-as-a-Judge for Time Series Explanations

Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 45% Moderate protocol signal Freshness: Hot Status: Fallback

Llm As JudgeAutomatic Metrics General

Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional…
To support this, we construct a synthetic benchmark of 350 time series cases across seven query types, each paired with correct, partially correct, and incorrect explanations.

Open paper

Diff-KD: Diffusion-based Knowledge Distillation for Collaborative Perception under Corruptions

Pengcheng Lyu, Chaokun Zhang, Gong Chen, Tao Tang, Zhaoxiang Luo · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 45% High protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Multi Agent General

Multi-agent collaborative perception enables autonomous systems to overcome individual sensing limits through collective intelligence.

Open paper

From High-Dimensional Spaces to Verifiable ODD Coverage for Safety-Critical AI-based Systems

Thomas Stefani, Johann Maximilian Christensen, Elena Hoemann, Frank Köster, Sven Hallerbach · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 42% Moderate protocol signal Freshness: Hot Status: Fallback

Simulation Env Long Horizon Math

While Artificial Intelligence (AI) offers transformative potential for operational performance, its deployment in safety-critical domains such as aviation requires strict adherence to rigorous certification standards.
Ultimately, this method enables the validation of ODD coverage in higher dimensions, advancing a Safety-by-Design approach while complying with EASA's standards.

Open paper

Steerable Visual Representations

Jona Ruthardt, Manu Gaur, Deva Ramanan, Makarand Tapaswi, Yuki M. Asano · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

Open paper

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

Sarath Shekkizhar, Romain Cosentino, Adam Earle · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

Open paper

VOID: Video Object and Interaction Deletion

Saman Motamed, William Harvey, Benjamin Klein, Luc Van Gool, Zhuoning Yuan, Ta-Ying Cheng · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

Open paper

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Chongjie Ye, Cheng Cao, Chuanyu Pan, Yiming Hao, Yihao Zhi, Yuanming Hu · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

Open paper

Optimizing Interventions for Agent-Based Infectious Disease Simulations

Anja Wolpers, Johannes Ponge, Adelinde M. Uhrmacher · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

Open paper

ProCeedRL: Process Critic with Exploratory Demonstration Reinforcement Learning for LLM Agentic Reasoning

Jingyue Gao, Yanjiang Guo, Xiaoshuai Chen, Jianyu Chen · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

Open paper

How and why does deep ensemble coupled with transfer learning increase performance in bipolar disorder and schizophrenia classification?

Sara Petiton, Antoine Grigis, Benoit Dufumier, Edouard Duchesnay · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

Open paper

GenGait: A Transformer-Based Model for Human Gait Anomaly Detection and Normative Twin Generation

Elisa Motta, Marta Lorenzini, Clara Mouawad, Alberto Ranavolo, Mariano Serrao, Arash Ajoudani · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

Open paper

The Self Driving Portfolio: Agentic Architecture for Institutional Asset Management

Andrew Ang, Nazym Azimbayev, Andrey Kim · Apr 2, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 38% Sparse protocol signal Freshness: Hot Status: Fallback

Critique Edit Coding

Agentic AI shifts the investor's role from analytical execution to oversight.
We present an agentic strategic asset allocation pipeline in which approximately 50 specialized agents produce capital market assumptions, construct portfolios using over 20 competing methods, and critique and vote on each other's output.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives