Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 397 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

Zijun Wang, Haoqin Tu, Letian Zhang, Hardy Chen, Juncheng Wu, Xiangyan Liu · Apr 6, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Law
  • OpenClaw, the most widely deployed personal AI agent in early 2026, operates with full local system access and integrates with sensitive services such as Gmail, Stripe, and the filesystem.
  • To address this gap, we present the first real-world safety evaluation of OpenClaw and introduce the CIK taxonomy, which unifies an agent's persistent state into three dimensions, i.e., Capability, Identity, and Knowledge, for safety…
Open paper
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs

Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Ming Lei, Jie Gao · Apr 9, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and overhead, while hallucinations further limit real-world deployment.
  • Experiments on Mandarin and English benchmarks show that our method achieves competitive performance with state-of-the-art models using only 2.3B parameters, while also effectively mitigating hallucinations through our decoupling-oriented…
Open paper
Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Fine-tuned checkpoints and evaluation scripts are released on HuggingFace.
Open paper

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Multilingual
  • Pashto is spoken by approximately 60--80 million people but has no published benchmarks for multilingual automatic speech recognition (ASR) on any shared public test set.
  • For cross-domain evaluation, five fine-tuned Pashto ASR models are evaluated on both test sets: published WER figures of 14% degrade to 32.5--59% on out-of-distribution sets, while one augmented model achieves 35.1% on both sets with zero…
Open paper
CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech

Youssef Saidi, Haroun Elleuch, Fethi Bougares · Apr 2, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Multilingual
  • We benchmark both pipeline systems (ASR + text NER) and E2E models based on Whisper and AraBEST-RQ.
  • Our dataset and models are publicly released, providing the first open benchmark for end-to-end named entity recognition from Arabic speech https://huggingface.co/datasets/Elyadata/CV18-NER.
Open paper
Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

Thibault Bañeras-Roux, Sergio Burdisso, Esaú Villatoro-Tello, Dairazalia Sánchez-Cortés, Shiran Liu, Severin Baroudi · Apr 7, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Sparse protocol signal Freshness: Hot Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
To Adapt or not to Adapt, Rethinking the Value of Medical Knowledge-Aware Large Language Models

Ane G. Domingo-Aldama, Iker De La Iglesia, Maitane Urruela, Aitziber Atutxa, Ander Barrena · Apr 8, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Sparse protocol signal Freshness: Hot Status: Ready
Medicine
  • BACKGROUND: Recent studies have shown that domain-adapted large language models (LLMs) do not consistently outperform general-purpose counterparts on standard medical benchmarks, raising questions about the need for specialized clinical…
  • We introduce a perturbation based evaluation benchmark that probes model robustness, instruction following, and sensitivity to adversarial variations.
Open paper
The Model Agreed, But Didn't Learn: Diagnosing Surface Compliance in Large Language Models

Xiaojie Gu, Ziying Huang, Weicong Hong, Jian Xie, Renze Lou, Kai Zhang · Apr 7, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Sparse protocol signal Freshness: Hot Status: Ready
Coding
  • However, while recent editors demonstrate high success rates on standard benchmarks, it remains questionable whether current evaluation frameworks that rely on assessing output under specific prompting conditions can reliably authenticate…
  • This probing reveals a pervasive phenomenon of Surface Compliance, where editors achieve high benchmark scores by merely mimicking target outputs without structurally overwriting internal beliefs.
Open paper
MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

Yuchi Wang, Haiyang Yu, Weikang Bian, Jiefeng Long, Xiao Liang, Chao Feng · Apr 7, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 68% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics General
  • Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.
Open paper
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues

Haofu Yang, Jiaji Liu, Chen Huang, Faguo Wu, Wenqiang Lei, See-Kiong Ng · Apr 13, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 64% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Coding
  • Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies.
  • Experimental results across two benchmarks show that METRO demonstrates promising performance, outperforming existing methods by an average of 9%-10%.
Open paper
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

Wenbo Hu, Xin Chen, Yan Gao-Tian, Yihe Deng, Nanyun Peng, Kai-Wei Chang · Apr 9, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Sparse protocol signal Freshness: Hot Status: Ready
Long Horizon Math
  • Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.
Open paper
AtomEval: Atomic Evaluation of Adversarial Claims in Fact Verification

Hongyi Cen, Mingxin Wang, Yule Liu, Jingyi Zheng, Hanze Jia, Tan Tang · Apr 9, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Sparse protocol signal Freshness: Hot Status: Ready
General
  • We introduce AtomEval, a validity-aware evaluation framework that decomposes claims into subject-relation-object-modifier (SROM) atoms and scores adversarial rewrites with Atomic Validity Scoring (AVS), enabling detection of factual…
  • Experiments on the FEVER dataset across representative attack strategies and LLM generators show that AtomEval provides more reliable evaluation signals in our experiments.
Open paper
Pickalo: Leveraging 6D Pose Estimation for Low-Cost Industrial Bin Picking

Alessandro Tarsi, Matteo Mastrogiuseppe, Saverio Taliani, Simone Cortinovis, Ugo Pattacini · Apr 6, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Sparse protocol signal Freshness: Hot Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao · Apr 6, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 68% High protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon Coding
  • Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self-evolving paradigms remain inefficient: agents learn in isolation, repeatedly rediscover similar behaviors from limited…
  • To address this problem, we propose SkillX, a fully automated framework for constructing a plug-and-play skill knowledge base that can be reused across agents and environments.
Open paper
AI generates well-liked but templatic empathic responses

Emma Gueorguieva, Hongli Zhan, Jina Suh, Javier Hernandez, Tatiana Lau, Junyi Jessy Li · Apr 9, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready
General
  • Recent research shows that greater numbers of people are turning to Large Language Models (LLMs) for emotional support, and that people rate LLM responses as more empathic than human-written responses.
  • Across a set of 2 studies comparing a total of n = 3,265 AI-generated (by six models) and n = 1,290 human-written responses, we find that LLM responses are highly formulaic at a discourse functional level.
Open paper
Short Data, Long Context: Distilling Positional Knowledge in Transformers

Patrick Huber, Ernie Chang, Chinnadhurai Sankar, Rylan Conway, Igor Fedorov, Md Rifat Arefin · Apr 7, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 64% Moderate protocol signal Freshness: Hot Status: Fallback
Simulation Env Multi Agent Law
  • We present the Strategic Courtroom Framework, a multi-agent simulation environment in which prosecution and defense teams composed of trait-conditioned Large Language Model (LLM) agents engage in iterative, round-based legal argumentation.
  • Agents are instantiated using nine interpretable traits organized into four archetypes, enabling systematic control over rhetorical style and strategic orientation.
Open paper
Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning

Bangji Yang, Hongbo Ma, Jiajun Fan, Ge Liu · Apr 2, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics MathLaw
  • Across both 1.5B and 4B model families, BCR reduces token usage by 15.8% to 62.6% while consistently maintaining or improving accuracy across five major mathematical benchmarks.
Open paper

Match reason: Matched by broad semantic/index fallback.

Score: 45% High protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon Math
  • We introduce TrACE (Trajectorical Adaptive Compute via agrEement), a training-free controller that allocates LLM calls adaptively across agent timesteps by measuring inter-rollout action agreement.
  • We evaluate TrACE against greedy decoding and fixed-budget self-consistency (SC-4, SC-8) on two benchmarks spanning single-step reasoning (GSM8K, n=50) and multi-step household navigation (MiniHouse, n=30), using a Qwen 2.5 3B Instruct…
Open paper
When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning

Juarez Monteiro, Nathan Gavenski, Gianlucca Zuin, Adriano Veloso · Apr 2, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready
Web Browsing General
  • Reinforcement learning (RL) agents often struggle with out-of-distribution (OOD) scenarios, leading to high uncertainty and random behavior.
  • We introduce Adaptive Safety through Knowledge (ASK), which combines smaller LMs with trained RL policies to enhance OOD generalization without retraining.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.