Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 742 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

WISE: Web Information Satire and Fakeness Evaluation

Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury · Dec 30, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as…
  • Our evaluation reveals that MiniLM, a lightweight model, achieves the highest accuracy (87.58%) among all models, while RoBERTa-base achieves the highest ROC-AUC (95.42%) and strong accuracy (87.36%).
Open paper
Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages

Anaelia Ovalle, Candace Ross, Sebastian Ruder, Adina Williams, Karen Ullrich, Mark Ibrahim · Dec 27, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Multilingual
  • We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages.
  • We develop an error taxonomy through human annotation to characterize these failures, finding they stem primarily from evidential errors (unsupported claims, ambiguous facts) followed by illogical reasoning steps.
Open paper
In-Context Algebra

Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau · Dec 18, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Towards Efficient Agents: A Co-Design of Inference Architecture and System

Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu, Hanting Chen · Dec 20, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon General
  • The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making.
  • Experiments on the BrowseComp-zh and DeepDiver benchmarks demonstrate that through the synergistic collaboration of these methods, AgentInfer reduces ineffective token consumption by over 50%, achieving an overall 1.8-2.5 times speedup with…
Open paper
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills

Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu · Dec 18, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% High protocol signal Freshness: Cold Status: Ready
Pairwise Preference Automatic Metrics Tool Use Law
  • Large language model (LLM) agents are moving beyond prompting alone.
  • ChatGPT marked the rise of general-purpose LLM assistants, DeepSeek showed that on-policy reinforcement learning with verifiable rewards can improve reasoning and tool use, and OpenClaw highlights a newer direction in which agents…
Open paper
DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation

Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull, Daniel R. Cahn · Dec 23, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic MetricsSimulation Env General
  • Realistic user simulation is crucial for training and evaluating multi-turn dialogue systems, yet creating simulators that accurately replicate human behavior remains a significant challenge.
  • An effective simulator must expose the failure modes of the systems under evaluation.
Open paper
LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence

Wenjin Liu, Haoran Luo, Xin Feng, Xiang Ji, Lijuan Zhou, Rui Mao · Dec 4, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics Law
  • However, existing benchmarks are result-oriented and fail to systematically evaluate the legal intelligence of large language models (LLMs), hindering the development of legal GI.
  • To address this, we propose LexGenius, an expert-level Chinese legal benchmark for evaluating legal GI in LLMs.
Open paper
The Moralization Corpus: Frame-Based Annotation and Analysis of Moralizing Speech Acts across Diverse Text Genres

Maria Becker, Mirko Sommer, Lars Tapken, Yi Wan Teh, Bruno Brocai · Dec 17, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 71% Sparse protocol signal Freshness: Cold Status: Ready
Coding
  • Moralizations are pragmatically complex and often implicit, posing significant challenges for both human annotators and NLP systems.
  • We further evaluate several large language models (LLMs) under varied prompting conditions for the task of moralization detection and moralization component extraction and compare it to human annotations in order to investigate the…
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 68% Sparse protocol signal Freshness: Cold Status: Ready
General
  • LLM-based agents are increasingly deployed for expert decision support, yet human-AI teams in high-stakes settings do not yet reliably outperform the best individual.
  • We propose Collaborative Causal Sensemaking (CCS) as a research agenda to develop this capability from the ground up, spanning new training environments that reward collaborative thinking, representations for shared human-AI mental models,…
Open paper
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

Daeun Lee, Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Dac Lai, Seunghyun Yoon · Dec 1, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 68% Sparse protocol signal Freshness: Cold Status: Ready
Coding
  • While prior streaming benchmarks evaluate temporal reasoning, none measure whether Multimodal Large Language Models (MLLMs) can interpret or leverage human gaze signals within a streaming setting.
  • To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs utilize gaze for temporal and proactive reasoning in streaming videos.
Open paper
Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

David Samuel, Lilja Øvrelid, Erik Velldal, Andrey Kutuzov · Dec 9, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 71% Sparse protocol signal Freshness: Cold Status: Fallback
Pairwise Preference Multilingual
  • Preference optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese.
  • To address this, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language.
Open paper
Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He · Dec 31, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready
Long Horizon General
  • We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimizes the production pipeline for agentic model.
  • Empirically, we evaluate ROME within a structured setting and introduce Terminal Bench Pro, a benchmark with improved scale and contamination control.
Open paper
DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation

Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song · Dec 19, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready
Rubric RatingExpert Verification Long Horizon General
  • However, evaluating such reports remains challenging: report quality is multifaceted, making it difficult to determine what to assess and by what criteria; LLM-based judges may miss errors that require domain expertise to identify; and…
  • To address these issues, we propose DEER, a benchmark for evaluating expert-level deep research reports.
Open paper
A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media

Mengfan Shen, Kangqi Song, Xindi Wang, Wei Jia, Tao Wang, Ziqiang Han · Dec 18, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
OnCoCo 1.0: A Public Dataset for Fine-Grained Message Classification in Online Counseling Conversations

Jens Albrecht, Robert Lehmann, Aleksandra Poltermann, Eric Rudolph, Philipp Steigerwald, Mara Stieler · Dec 10, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready
Coding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Multilingual Medical Reasoning for Question Answering with Large Language Models

Pietro Ferrazzi, Aitor Soroa, Rodrigo Agerri · Dec 5, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready
MedicineMultilingual
  • We test our pipeline in both in-domain and out-of-domain settings across Medical QA benchmarks, and demonstrate that our reasoning traces improve performance both when utilized via in-context learning (few-shot) and supervised fine-tuning,…
Open paper
Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready
Multilingual
  • Ensuring content safety in large language models (LLMs) is essential for their deployment in real-world applications.
  • To address this, we introduce CREST (CRoss-lingual Efficient Safety Transfer), a parameter-efficient multilingual safety classification model that supports 100 languages with only 0.5B parameters.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.