Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 501 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Match reason: Ranked by recency.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics General
  • Our contributions are three-fold: (1) a structured representation framework for situated lexical meaning; (2) COCA-Scenes, a dataset of 520 usage instances across 26 keywords for distinct scene identification; and (3) empirical evidence…
Open paper
Citations: 0

Match reason: Ranked by recency.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Coding
  • The annotation process combines expert human judgment with model-assisted pre-labeling verified by trained annotators, achieving substantial inter-annotator agreement (Cohens kappa = 0.85).
Open paper

Match reason: Ranked by recency.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Large language model (LLM) agents still struggle with long-term memory question answering, where answer-supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content.
Open paper
SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

Xiaolong Zhou, Yifei Liu, Ziyang Gong, Jiarui Li, Qiyue Zhao, Muyao Niu · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready
Simulation Env General
  • Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world…
  • Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of…
Open paper
BeLink: Biomedical Entity Linking Meets Generative Re-Ranking

Darya Shlyk, Stefano Montanelli, Lawrence Hunter · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Medicine
  • Our method demonstrates strong performance on multiple BEL benchmarks, yielding significant improvements in linking accuracy (3%-24%) while reducing inference time compared to the state-of-the-art.
Open paper
Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity

Hangyue Zhao, Paul Caillon, Erwan Fagnou, Alexandre Allauzen · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Recent task-specific attention operators can compress deep Transformer stacks into a few layers by performing multi-hop state propagation within a single layer, but their dense evaluation remains expensive.
  • On controlled tracking benchmarks, our method matches the dense operator's accuracy while reducing wall-clock time by 12-29\% under a standardized measurement protocol, and is up to 2.4 \times faster than a compact dense Transformer at…
Open paper

Match reason: Ranked by recency.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Two NLA-inspired evaluations strengthen this picture: the fifteen selective features explain only 31% of activation variance versus the SAE's 99.7%, and selectivity ratio anticorrelates with causal force (r = -0.56).
  • A cost-based deployment evaluation (assumed 50/FN, 0.42/FP, 2% error rate) finds an optimal monitor configuration yielding 8.96 per 1000 queries against a 1000 baseline, a 99.1% saving.
Open paper
LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance

Yuchun Fan, Bei Li, Peiguang Li, Yilin Wang, Yongyu Mu, Jian Yang · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready
Long Horizon MathMultilingual
  • Empirical results on challenging multilingual mathematical benchmarks reveal that LANG substantially enhances reasoning performance without compromising language consistency.
Open paper
Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

Zihan Liang, Yufei Ma, Ben Chen, Zhipeng Qian, Xuxin Zhang, Huangyu Dai · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready
Long Horizon Coding
  • Post-training has become the dominant recipe for turning a language model into a competent search-augmented reasoning agent.
  • We take a step back and ask whether any of this machinery is actually necessary, and propose Search-E1, a self-evolution method that lets a search-augmented agent improve through only vanilla GRPO interleaved with offline self-distillation…
Open paper
Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion

Meimingwei Li, Yuanhao Ding, Esteban Garces Arias, Christian Heumann · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

Shuaiqi Wang, Aadyaa Maddi, Zinan Lin, Giulia Fanti · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready
Coding
  • We introduce SynAE, an evaluation framework for assessing how well synthetic benchmarks for multi-turn, tool-calling agents replicate and augment the characteristics of real data trajectories.
  • We evaluate SynAE using recent agent benchmarks and test common synthetic data failure modes via realistic and controlled generation schemes.
Open paper
Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready
General
  • The single-point evaluation ignores a main problem of the instruction-based approach namely: sensitivity to the phrasing of the instruction.
  • Our findings suggest that single-prompt evaluation is insufficient for instruction-tuned embedding models and that benchmarks should incorporate prompt robustness, either by evaluating over multiple prompts or by reporting sensitivity…
Open paper
Reflecti-Mate: A Conversational Agent for Adaptive Decision-Making Support Through System 1 and System 2 Thinking

Morita Tarvirdians, Senthil Chandrasegaran, Hayley Hung, Catholijn M. Jonker, Catharine Oertel · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready
General
  • In this study, we investigate an agent designed to encourage integration by adapting to the individual user's thought patterns.
  • We explore its effects on participants' perceptions of the agent and their reflective behavior, in comparison with unaided pre-reflection and a baseline agent.
Open paper
Polite on the Surface, Wrong in Practice: A Curated Dataset for Fixing Honorific Failures in Multilingual Bangla Generation

Md. Asaduzzaman Shuvo, Mahedi Hasan, Md. Tashin Parvez, Azizul Haque Noman, Md. Shafayet Hossain Ovi · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready
CodingMultilingual
  • To address this limitation, we introduce a novel, culturally aligned instruction-tuning dataset for BangLa Application and DialoguE generation - BLADE and benchmarking framework comprising 4,196 meticulously curated interaction pairs.
  • Our empirical evaluations demonstrate that models fine-tuned on our dataset yield substantial improvements in structural fidelity and honorific alignment, providing a rigorous benchmark for bridging pragmatic disparities in low-resource…
Open paper
Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation

Genoveffa Martone, Helena Bonaldi, Marco Guerini · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready
General
  • 23 experts revise the generated CS, which are assessed via human and automatic metrics.
  • Based on the post-edited CS, the mixed strategy proves to be the most effective in crowdsourcing evaluation, pairing strong factual correction with stereotype mitigation and empathetic engagement.
Open paper
Unified Data Selection for LLM Reasoning

Xiaoyuan Li, Yubo Ma, Chengpeng Li, Fengbin Zhu, Yiyao Yu, Keqin Bao · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.