Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 74 Search mode: keyword Ranking: eval-signal prioritized Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou, Junshan Zhang · Apr 8, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 100% High protocol signal Freshness: Hot Status: Ready
Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics General
  • Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
  • To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences.
Open paper
Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 100% High protocol signal Freshness: Hot Status: Ready
Expert Verification Llm As JudgeAutomatic Metrics Medicine
  • In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
  • PubMed Reasoner with a GPT-4o backbone achieves 78.32% accuracy on PubMedQA, slightly surpassing human experts, and showing consistent gains on MMLU Clinical Knowledge.
Open paper
Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning

Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu, Xinyu Dai · Apr 2, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 95% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics General
  • However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process.
  • To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM's generation quality.
Open paper

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 95% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics General
  • Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task.
  • Beyond accuracy, we integrate SHAP and LIME to provide fine-grained interpretability, revealing that model decisions depend on contextualized safety and supportive framing rather than isolated keywords.
Open paper
FGR-ColBERT: Identifying Fine-Grained Relevance Tokens During Retrieval

Antonín Jarolím, Martin Fajčík · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 95% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
MemRerank: Preference Memory for Personalized Product Reranking

Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yu Gong · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 95% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics General
  • LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch.
  • We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking.
Open paper
Analysing Calls to Order in German Parliamentary Debates

Nina Smirnova, Daniel Dan, Philipp Mayr · Mar 27, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 95% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Law
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework

Ben Chen, Siyuan Wang, Yufei Ma, Zihan Liang, Xuxin Zhang, Yue Lv · Mar 25, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 95% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics General
  • However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement.
  • It contains three key innovations: (1) a thought-augmented complex query understanding module, which enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference; (2) a reasoning-internalized…
Open paper
DomAgent: Leveraging Knowledge Graphs and Case-Based Reasoning for Domain-Specific Code Generation

Shuai Wang, Dhasarathy Parthasarathy, Robert Feldt, Yinan Yu · Mar 22, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 95% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Coding
  • To address this challenge, we propose DomAgent, an autonomous coding agent that bridges this gap by enabling LLMs to generate domain-adapted code through structured reasoning and targeted retrieval.
  • We evaluate DomAgent on an open benchmark dataset in the data science domain (DS-1000) and further apply it to real-world truck software development tasks.
Open paper
Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval

Hangeol Chang, Changsun Lee, Seungjoon Rho, Junho Yeo, Jong Chul Ye · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 95% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Coding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

Marcel Gröpl, Jaewoo Jung, Seungryong Kim, Marc Pollefeys, Sunghwan Hong · Apr 9, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Experiments on seven benchmarks across four VLM architectures demonstrate consistent improvements over existing methods, with the largest gains on detail-critical and high-resolution settings, while also producing more interpretable…
Open paper
ChunQiuTR: Time-Keyed Temporal Retrieval in Classical Chinese Annals

Yihao Wang, Zijian He, Jie Ren, Keze Wang · Apr 8, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Coding
  • We introduce ChunQiuTR, a time-keyed retrieval benchmark built from the Spring and Autumn Annals and its exegetical tradition.
  • Experiments show consistent gains over strong semantic dual-encoder baselines under time-keyed evaluation, supporting retrieval-time temporal consistency as a key prerequisite for faithful downstream historical RAG.
Open paper
JUÁ -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections

Jayr Pereira, Leandro Fernandes, Erick de Brito, Roberto Lotufo, Luiz Bonifacio · Apr 7, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Law
  • We present JUÁ, a public benchmark for Brazilian legal retrieval designed to support more reproducible and comparable evaluation across heterogeneous legal collections.
  • More broadly, JUÁ is intended not only as a benchmark, but as a continuous evaluation infrastructure for Brazilian legal IR, combining shared protocols, common ranking metrics, fixed splits when applicable, and a public leaderboard.
Open paper
VISTA: Visualization of Token Attribution via Efficient Analysis

Syed Ahmed, Bharathi Vokkaliga Ganesh, Jagadish Babu P, Karthick Selvaraj, Praneeth Talluri, Sanket Hingne · Apr 2, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics MathCoding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Screening Is Enough

Ken M. Nakanishi · Apr 1, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • In addition, among these semantic metrics, those trained with human ratings can detect omissions and other subtle semantic issues that embedding-based metrics often miss.
Open paper
How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó · Mar 26, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 95% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon General
  • Long-horizon conversational agents require persistent memory for coherent reasoning, yet uncontrolled accumulation causes temporal decay and false memory propagation.
  • Benchmarks such as LOCOMO and LOCCO report performance degradation from 0.455 to 0.05 across stages, while MultiWOZ shows 78.2% accuracy with 6.8% false memory rate under persistent retention.
Open paper

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 88% Moderate protocol signal Freshness: Warm Status: Ready
Rubric Rating Automatic Metrics General
  • Experiments on the multimodal EssayJudge dataset show that DLOM improves over a generation-based SFT baseline across scoring traits, and DLOM-GF yields further gains when modality relevance is heterogeneous.
  • On the text-only ASAP/ASAP++ benchmarks, DLOM remains effective without visual inputs, and DLOM-DA further improves performance and outperforms strong representative baselines.
Open paper
PJB: A Reasoning-Aware Benchmark for Person-Job Retrieval

Guangzhi Wang, Xiaohui Yang, Kai Li, Jiawen He, Kai Yang, Ruixuan Zhang · Mar 18, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • As retrieval models converge on generic benchmarks, the pressing question is no longer "who scores higher" but rather "where do systems fail, and why?" Person-job matching is a domain that urgently demands such diagnostic capability -- it…
  • We introduce PJB (Person-Job Benchmark), a reasoning-aware retrieval evaluation dataset that uses complete job descriptions as queries and complete resumes as documents, defines relevance through job-competency judgment, is grounded in…
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.