Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 137 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

GLiGuard: Schema-Conditioned Classification for LLM Safeguard

Urchade Zaratiana, Mary Newhauser, George Hurn-Maloney, Ash Lewis · May 8, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Red Team Automatic Metrics Coding
  • Ensuring safe, policy-compliant outputs from large language models requires real-time content moderation that can scale across multiple safety dimensions.
  • Across nine established safety benchmarks, GLiGuard achieves F1 scores competitive with 7B--27B decoder-based guards despite being 23--90\times smaller, while delivering up to 16\times higher throughput and 17\times lower latency.
Open paper
StoryAlign: Evaluating and Training Reward Models for Story Generation

Haotian Xia, Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou · May 6, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics Coding
  • Although large language models (LLMs) have significantly advanced text generation, stories generated by LLMs still diverge from human-authored works regarding complex narrative structure and human-aligned preferences.
  • We find existing reward models struggle to select human-preferred stories, with the best model achieving only 66.3\% accuracy.
Open paper
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

Tianyang Han, Hengyu Shi, Junjie Hu, Xu Yang, Zhiling Wang, Junhao Su · May 5, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Rubric Rating Automatic Metrics Long Horizon MathLaw
  • Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only…
Open paper
SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs

Sihang, Zhao, Kangrui Yu, Youliang Yuan, Pinjia He, Hongyi Wen · Apr 24, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Red Team Automatic Metrics Coding
  • To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge-mastery graph and introduce SHAPE, a benchmark of 9,087 student-question pairs for evaluating tutoring behavior under adversarial…
  • Experiments across multiple LLMs show that our method yields significantly improved safety under two pedagogical jailbreak settings, while maintaining near-ceiling helpfulness under the same evaluation protocol.
Open paper
Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?

Anmol Gulati, Hariom Gupta, Elias Lumer, Sahil Sen, Vamse Kumar Subbiah · May 8, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon Coding
  • Long-horizon AI agents execute complex workflows spanning hundreds of sequential actions, yet a single wrong assumption early on can cascade into irreversible errors.
  • We introduce a forced-injection framework that provides ground-truth clarifications at controlled points in the agent's trajectory across four information dimensions (goal, input, constraint, context), three agent benchmarks, and four…
Open paper
InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Bohan Hou, Jiuning Gu, Jiayan Guo, Ronghao Dang, Sicong Leng, Xin Li · May 8, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 65% High protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Tool Use Coding
  • We introduce InterLV-Search, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search.
  • Experiments on proprietary and open-source multimodal agents show that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges in visual evidence…
Open paper

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Web Browsing Coding
  • We demonstrate the system's effectiveness through comprehensive evaluation across multiple extraction scenarios in Traditional Chinese Medicine research, achieving structured output compliance rates exceeding 94\% and information extraction…
Open paper
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

Zihan Lin, Xiaohan Wang, Jie Cao, Jiajun Chai, Li Wang, Xiaodong Lu · May 1, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Tool Use MathCoding
  • ResRL then projects negative-token hidden representations onto an SVD-based low-rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong…
Open paper
Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA

Zhanli Li, Yixuan Cao, Lvzhou Luo, Ping Luo · Apr 24, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 65% High protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Multi Agent Coding
  • We present MuDABench, a benchmark for multi-document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis.
  • To address these limitations, we propose a multi-agent workflow that orchestrates planning, extraction, and code generation modules.
Open paper
QED-Nano: Teaching a Tiny Model to Prove Hard Theorems

LM-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching, Jia Li · Apr 6, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Rubric Rating Automatic Metrics MathCoding
  • To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.
Open paper
Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers

Atsuyuki Miyai, Mashiro Toyooka, Zaiying Zhao, Kenta Watanabe, Toshihiko Yamasaki, Kiyoharu Aizawa · Apr 1, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Rubric Rating Automatic Metrics Coding
  • We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal…
  • For evaluation, we introduce PaperWrite-Bench, a benchmark of 51 papers from top-tier venues across diverse domains published after 2025.
Open paper
Do Phone-Use Agents Respect Your Privacy?

Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang, Yiduo Guo · Apr 1, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Coding
  • We study whether phone-use agents respect privacy while completing benign mobile tasks.
  • To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents.
Open paper
Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Tool Use Coding
  • Across five model configurations, two families, and three benchmarks, we find that 52--88% of chain-of-thought tokens are produced after the answer is recoverable from a partial prefix.
Open paper
Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

Komal Kumar, Aman Chadha, Salman Khan, Fahad Shahbaz Khan, Hisham Cholakkal · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Multi Agent Coding
  • Recent advances in multi-agent large language models (LLMs) have demonstrated strong potential for understanding user intent and are being trained to utilize various tools.
  • In this paper, we introduce Paper Circle, a multi-agent research discovery and analysis system designed to reduce the effort required to find, assess, organize, and understand academic literature.
Open paper
AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning

Yuanfu Sun, Kang Li, Dongzhe Fan, Jiajin Liu, Qiaoyu Tan · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Tool Use Coding
  • To bridge this gap, we introduce Agentic Graph Learning (AGL), a paradigm that reframes graph learning as an interleaved process of topology-aware navigation and LLM-based inference.
  • Specifically, we propose AgentGL, the first reinforcement learning (RL)-driven framework for AGL.
Open paper
SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao · Apr 6, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon Coding
  • Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self-evolving paradigms remain inefficient: agents learn in isolation, repeatedly rediscover similar behaviors from limited…
  • To address this problem, we propose SkillX, a fully automated framework for constructing a plug-and-play skill knowledge base that can be reused across agents and environments.
Open paper
Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Web Browsing Coding
  • Extensive evaluations across 1.5B--14B parameter models demonstrate that APC reduces expected editing costs from 19% to 50% while preserving standard HC performance.
Open paper

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon MathCoding
  • Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval.
  • Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.