Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 664 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

Jingen Qu, Lijun Li, Bo Zhang, Yichen Yan, Jing Shao · Sep 4, 2025

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 71% Sparse protocol signal Freshness: Cold Status: Ready
Llm As Judge General
  • Multimodal large language models (MLLMs) are rapidly evolving, presenting increasingly complex safety challenges.
  • Additionally, we introduce a standardized safety dataset evaluation metric: fine-tuning a safety judge model and evaluating its capabilities on other safety datasets.Extensive experiments on various tasks demonstrate the effectiveness of…
Open paper

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 68% Sparse protocol signal Freshness: Cold Status: Ready
General
  • Human and LLM-based evaluations demonstrate that NormGenesis significantly outperforms existing datasets in refinement quality, dialogue naturalness, and generalization performance.
  • Our work establishes a new benchmark for culturally adaptive dialogue modeling and provides a scalable methodology for norm-aware generation across linguistically and culturally diverse languages.
Open paper
Classification errors distort findings in automated speech processing: examples and solutions from child-development research

Lucas Gautheron, Evan Kidd, Anton Malko, Marvin Lavechin, Alejandrina Cristia · Aug 21, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics Math
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Predicting LLM Reasoning Performance with Small Proxy Model

Woosung Koh, Juyoung Suk, Sungjun Han, Se-Young Yun, Jamin Shin · Sep 25, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • In our experiments, rBridge (i) reduces dataset ranking costs by over 100x relative to the best baseline, (ii) achieves the strongest correlation across six reasoning benchmarks at 1B to 32B scale, and (iii) zero-shot transfers predictive…
Open paper
From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

Tianqiao Liu, Xueyi Li, Hao Wang, Haoxuan Li, Zhichao Chen, Weiqi Luo · Sep 24, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics Coding
  • Comprehensive experiments on Audio-QA, ASR, AAC and speech-to-speech benchmarks show that TtT consistently surpasses strong AR and NAR baselines, with additional ablation and training-strategy analyses confirming the contribution of each…
Open paper
On the Theoretical Limitations of Embedding-Based Retrieval

Orion Weller, Michael Boratko, Iftekhar Naim, Jinhyuk Lee · Aug 28, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics Coding
  • These new benchmarks push embeddings to work for any query and any notion of relevance that could be given.
Open paper
ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin · Sep 16, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 49% Sparse protocol signal Freshness: Cold Status: Ready
Tool Use General
  • Large Language Model (LLM)-based web agents excel at knowledge-intensive tasks but face a fundamental conflict between the need for extensive exploration and the constraints of limited context windows.
  • Current solutions typically rely on architectural modifications, e.g., internal memory tokens, which break compatibility with pre-existing agents and necessitate costly end-to-end retraining.
Open paper

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% Moderate protocol signal Freshness: Cold Status: Fallback
Llm As JudgeAutomatic Metrics Multilingual
  • Our benchmark is built on a user manual for an agricultural machine, available in English, French, and German.
  • The evaluation focuses on realistic "needle-in-a-haystack" challenges and includes unanswerable questions to test for hallucinations.
Open paper
Linguistic trajectories of bipolar disorder on social media

Laurin Plank, Armin Zlomuzica · Sep 12, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 49% Sparse protocol signal Freshness: Cold Status: Ready
Medicine
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Language and Experience: A Computational Model of Social Learning in Complex Tasks

Cédric Colas, Tracey Mills, Ben Prystawski, Michael Henry Tessler, Noah Goodman, Jacob Andreas · Aug 26, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 49% Sparse protocol signal Freshness: Cold Status: Ready
Simulation Env General
  • The ability to combine linguistic guidance from others with direct experience is central to human development, enabling safe and rapid learning in new environments.
  • Using behavioral experiments and simulations across 10 video games, we show how linguistic guidance can shape exploration and accelerate learning by reducing risky interactions and speeding up key discoveries in both humans and models.
Open paper
UPDESH: Synthesizing Grounded Instruction Tuning Data for 13 Indic Languages

Pranjal A. Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay, Deepthi Sudharsan · Sep 25, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 46% Sparse protocol signal Freshness: Cold Status: Ready
Multilingual
  • Comprehensive evaluation using automated metrics and 10K human assessments confirms high data quality.
  • Finally, through ablation studies and cultural evaluations, we show that context-aware, culturally grounded data generation is essential for effective multilingual AI development.
Open paper
ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation

Jiho Kim, Junseong Choi, Woosog Chay, Daeun Kyung, Yeonsu Kwon, Yohan Jo · Sep 26, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 33% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise Preference Simulation Env General
  • In our simulation environment, a user agent with a rich persona interacts with the assistant, providing ratings on how well each suggestion aligns with its preferences and context.
  • Built on ProPerSim, we propose ProPerAssistant, a retrieval-augmented, preference-aligned assistant that continually learns and adapts through user feedback.
Open paper
RAPTOR: A Foundation Policy for Quadrotor Control

Jonas Eschmann, Dario Albani, Giuseppe Loianno · Sep 15, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 33% Moderate protocol signal Freshness: Cold Status: Ready
Demonstrations Simulation Env Long Horizon General
  • Humans are remarkably data-efficient when adapting to new unseen conditions, like driving a new car.
Open paper
Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • Converging evidence suggests that human systems of semantic categories achieve near-optimal compression via the Information Bottleneck (IB) complexity-accuracy tradeoff.
  • Large language models (LLMs) are not trained for this objective, which raises the question: are LLMs capable of evolving efficient human-aligned semantic systems?
Open paper
CLAUSE: Agentic Neuro-Symbolic Knowledge Graph Reasoning via Dynamic Learnable Context Engineering

Yang Zhao, Chengxiao Dai, Wei Zhuo, Yue Xiu, Dusit Niyato · Sep 25, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 33% High protocol signal Freshness: Cold Status: Fallback
Automatic Metrics Multi Agent General
  • We introduce CLAUSE, an agentic three-agent neuro-symbolic framework that treats context construction as a sequential decision process over knowledge graphs, deciding what to expand, which paths to follow or backtrack, what evidence to…
  • CLAUSE employs the proposed Lagrangian-Constrained Multi-Agent Proximal Policy Optimization (LC-MAPPO) algorithm to coordinate three agents: Subgraph Architect, Path Navigator, and Context Curator, so that subgraph construction,…
Open paper
AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation

Tiancheng Huang, Ruisheng Cao, Yuxin Zhang, Zhangyi Kang, Zijian Wang, Chenrun Wang · Sep 21, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 26% Sparse protocol signal Freshness: Cold Status: Ready
Long Horizon General
  • While large language models (LLMs) based agents are capable of automating question answering (QA) workflows for scientific papers, there still lacks a comprehensive and realistic benchmark to evaluate their capabilities.
  • In this work, we propose AirQA, a human-annotated comprehensive paper QA dataset in the field of artificial intelligence (AI), with 13,956 papers and 1,246 questions, that encompasses multi-task, multi-modal and instance-level evaluation.
Open paper
HebID: Detecting Social Identities in Hebrew-language Political Text

Guy Mor-Lan, Naama Rivlin-Angert, Yael R. Kaplan, Tamir Sheafer, Shaul R. Shenhav · Aug 21, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready
General
  • We benchmark multilabel and single-label encoders alongside 2B-9B-parameter generative LLMs, finding that Hebrew-tuned LLMs provide the best results (macro-F_1 = 0.74).
Open paper
Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready
General
  • However, the field is fragmented due to inconsistent terminology, isolated evaluation practices, and a lack of unified benchmarks.
  • To bridge this gap, we systematically analyze 134 papers, introduce a unified taxonomy of evidence-based text generation with LLMs, and investigate 300 evaluation metrics across seven key dimensions.
Open paper
UbiQTree: Uncertainty Quantification in XAI with Tree Ensembles

Akshat Dubey, Aleksandar Anžel, Bahar İlgen, Georges Hattab · Aug 13, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.