Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 110 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval

Chunyu Li, Jingyi Kang, Ding Chen, Mengyuan Zhang, Jiajun Shen, Bo Tang · May 7, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics General
  • In agent memory systems, the reranking model serves as the critical bridge connecting user queries with long-term memory.
  • On the memory retrieval benchmark, MemReranker-0.6B substantially outperforms BGE-Reranker and matches open-source 4B/8B models as well as GPT-4o-mini on key metrics.
Open paper
Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics

Dominik Dahlem, Diego Maniloff, Mac Misiura · May 6, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • The resulting two-axis diagnostic (φ for capacity, G for direction) yields a falsifiable polarity prediction: bottleneck- and diffuse-dominated benchmarks should exhibit opposite polarity.
  • Under length-controlled evaluation, transport features retain interpretable signal (LC-AUROC from 0.62 to 0.84) on tested models up to 8B parameters, with polarity reversing as predicted between HaluEval and MedHallu.
Open paper
CAR: Query-Guided Confidence-Aware Reranking for Retrieval-Augmented Generation

Zhipeng Song, Yizhi Zhou, Xiangyu Kong, Jiulong Jiao, Xuezhou Ye, Chunqi Gao · May 6, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

Yilun Zhao, Jinbiao Wei, Tingyu Song, Siyue Zhang, Chen Zhao, Arman Cohan · May 5, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • This capability is increasingly important for agentic search systems, where retrievers must provide complementary evidence across iterative search and synthesis.
  • We introduce BRIGHT-Pro, an expert-annotated benchmark that expands each query with multi-aspect gold evidence and evaluates retrievers under both static and agentic search protocols.
Open paper
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

Tianyang Han, Hengyu Shi, Junjie Hu, Xu Yang, Zhiling Wang, Junhao Su · May 5, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% High protocol signal Freshness: Hot Status: Ready
Rubric Rating Automatic Metrics Long Horizon MathLaw
  • Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only…
Open paper
AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Bin Wu, Arastun Mammadli, Xiaoyu Zhang, Emine Yilmaz · Apr 24, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Coding
  • The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task.
  • We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers.
Open paper
Reproducing Complex Set-Compositional Information Retrieval

Vincent Degenhart, Dewi Timman, Arjen P. de Vries, Faegheh Hasibi, Mohanna Hoveyda · May 5, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 72% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Coding
  • We conduct a reproducibility study to benchmark major retrieval families and reasoning-targeted methods on QUEST and QUEST+Variants, and introduce LIMIT+, a controlled benchmark where relevance depends on arbitrary attribute predicates and…
  • We release code and LIMIT+ data generation scripts to support future reproducibility and controlled evaluation.
Open paper
How Hard is it to Decide if a Fact is Relevant to a Query?

Meghyn Bienvenu, Diego Figueira, Pierre Lafourcade · Apr 24, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 72% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Relevance has already been shown to be harder than query evaluation: namely, Σ^p_2-complete for CQs, even over a binary signature.
  • Indeed, we prove that if we forbid or bound the occurrence of self-joins, then relevance has the same complexity as query evaluation, namely, NP (without structural restrictions) and LogCFL (for bounded hypertreewidth classes).
Open paper
Evaluation of Automatic Speech Recognition Using Generative Large Language Models

Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil, Sergio Burdisso, Petr Motlicek, Shiran Liu · Apr 23, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 72% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task.
  • On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics.
Open paper
Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 72% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 72% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • PIE connects Pruning, automatic Interpretation, and interpretation Evaluation, establishing a comprehensive benchmarking environment to systematically measure behavioral fidelity and downstream interpretability under pruning.
  • Because downstream evaluation costs scale linearly per feature, Synergy effectively grants the pipeline 25 "free" features, achieving K=75 fidelity while reducing interpretation costs by 33%.
Open paper
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou, Junshan Zhang · Apr 8, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% High protocol signal Freshness: Warm Status: Ready
Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics General
  • Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
  • To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences.
Open paper
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

Marcel Gröpl, Jaewoo Jung, Seungryong Kim, Marc Pollefeys, Sunghwan Hong · Apr 9, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 72% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Experiments on seven benchmarks across four VLM architectures demonstrate consistent improvements over existing methods, with the largest gains on detail-critical and high-resolution settings, while also producing more interpretable…
Open paper
DTCRS: Dynamic Tree Construction for Recursive Summarization

Guanran Luo, Zhongquan Jian, Wentao Qiu, Meihong Wang, Qingqiang Wu · Apr 8, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 72% Moderate protocol signal Freshness: Warm Status: Ready
Long Horizon General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
ChunQiuTR: Time-Keyed Temporal Retrieval in Classical Chinese Annals

Yihao Wang, Zijian He, Jie Ren, Keze Wang · Apr 8, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 72% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Coding
  • We introduce ChunQiuTR, a time-keyed retrieval benchmark built from the Spring and Autumn Annals and its exegetical tradition.
  • Experiments show consistent gains over strong semantic dual-encoder baselines under time-keyed evaluation, supporting retrieval-time temporal consistency as a key prerequisite for faithful downstream historical RAG.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.