Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 19 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

PRBench: End-to-end Paper Reproduction in Physics Research

Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li · Mar 29, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 100% Moderate protocol signal Freshness: Warm Status: Ready
Rubric RatingExpert Verification Automatic MetricsSimulation Env Coding
  • We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics.
  • Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution.
Open paper
CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

Hao Wang, Licheng Pan, Zhichao Chen, Chunyuan Zheng, Zhixuan Chu, Xiaoxi Li · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Coding
  • Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly…
  • Extensive experiments across diverse LLM backbones and benchmark datasets validate that CausalRM effectively learns accurate reward signals from noisy and biased observational feedback and delivers substantial performance improvements on…
Open paper
StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, Caiwen Ding · Mar 3, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Ready
Rubric Rating Automatic Metrics Multi Agent Coding
  • To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it…
  • Experiments on KernelBench show that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72x better speedup over the multi-agent baseline and 2.73x than the RL model baselines.
Open paper
Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation

Harry Stuart, Masahiro Kaneko, Timothy Baldwin · Mar 2, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Ready
Rubric Rating Automatic Metrics Coding
  • Effective hiring is integral to the success of an organisation, but it is very challenging to find the most suitable candidates because expert evaluation (e.g.\ interviews conducted by a technical manager) are expensive to deploy at scale.
Open paper
Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

Jingyi Xu, Xingyu Ren, Zhoupeng Shou, Yumeng Zhang, Zhiqiang You · Jan 24, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Long Horizon Coding
  • To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent.
  • We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data.
Open paper
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

Zherui Li, Zheng Nie, Zhenhong Zhou, Yue Liu, Yitong Zhang, Yu Cheng · Sep 29, 2025

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 98% Moderate protocol signal Freshness: Cold Status: Ready
Red Team Automatic Metrics Coding
  • Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final…
  • These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential.
Open paper
Activation-Guided Local Editing for Jailbreaking Attacks

Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang, Haohua Du · Aug 1, 2025

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 98% Moderate protocol signal Freshness: Cold Status: Ready
Red Team Automatic Metrics LawCoding
  • Token-level jailbreak attacks often produce incoherent or unreadable inputs and exhibit poor transferability, while prompt-level attacks lack scalability and rely heavily on manual effort and human ingenuity.
Open paper
Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Tool Use Coding
  • Large Language Model (LLM) agents increasingly use external tools for complex tasks and rely on embedding-based retrieval to select a small top-k subset for reasoning.
  • We provide theoretical analysis of retrieval saturation and show on standard benchmarks that ToolFlood achieves up to a 95% attack success rate with a low injection rate (1% in ToolBench).
Open paper
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 100% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon Coding
  • Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
  • Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion.
Open paper
CoAct-1: Computer-using Multi-Agent System with Coding Actions

Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li · Aug 5, 2025

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 98% High protocol signal Freshness: Cold Status: Fallback
Automatic Metrics Long Horizon Coding
  • In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action.
  • We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods.
Open paper
Do Phone-Use Agents Respect Your Privacy?

Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang, Yiduo Guo · Apr 1, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 81% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Coding
  • We study whether phone-use agents respect privacy while completing benign mobile tasks.
  • To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents.
Open paper
SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao · Apr 6, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 81% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon Coding
  • Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self-evolving paradigms remain inefficient: agents learn in isolation, repeatedly rediscover similar behaviors from limited…
  • To address this problem, we propose SkillX, a fully automated framework for constructing a plug-and-play skill knowledge base that can be reused across agents and environments.
Open paper
Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 81% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Tool Use Coding
  • We present Agent-Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world productivity software API tasks via code execution.
  • In addition, we evaluate the robustness of the framework with ablation experiments to assess the contribution of access to API documentation on benchmark performance.
Open paper
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary · Oct 5, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 76% High protocol signal Freshness: Cold Status: Ready
Rubric Rating Automatic MetricsSimulation Env Coding
  • We present a principled Bayesian evaluation framework that replaces Pass@k and average accuracy over N trials (avg@N) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and…
  • Together, these results recommend replacing Pass@k for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit.
Open paper
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song · Jun 17, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 76% High protocol signal Freshness: Cold Status: Fallback
Automatic Metrics Long Horizon Coding
  • We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents.
  • Empirical evaluations show that state-of-the-art LLM agents suffer a steep performance drop, from 18% success at difficulty level 1 to just 4% at level 6, highlighting the benchmark's difficulty and discriminative power.
Open paper
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri, Trizal Garg · Dec 26, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 53% High protocol signal Freshness: Cold Status: Ready
Expert Verification Automatic Metrics CodingMultilingual
  • To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
  • We evaluate six state-of-the-art models, including GPT-4o, Claude 3.7 Sonnet, and open-source models, using a strict evaluation protocol.
Open paper
Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification

Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav, Zsolt Kira · Jul 15, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 53% High protocol signal Freshness: Cold Status: Ready
Pairwise Preference Automatic MetricsSimulation Env Long Horizon MathCoding
  • We evaluate MLLM verifiers across web navigation, computer use, and robotics, spanning 13+ models, 28+ designs, and thousands of trajectories from diverse agents.
  • Our methods yield more human-aligned verifiers, improving failure detection by 25pp and accuracy by 14pp.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.