Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 18 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

Xing Zhang, Yanwei Cui, Guanghui Wang, Ziyuan Li, Wei Qiu, Bing Zhu · May 21, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • We introduce Ratchet, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills.
  • On MBPP+ hard-100 with Claude Opus 4.7, Ratchet lifts held-out pass@1 from a 0.258 \pm 0.047 baseline to a late-window rolling mean of 0.584 (peak 0.658 \pm 0.042) across 100 rounds and 3 seeds, a +0.328 \pm 0.018 rolling-mean gain where…
Open paper
Orchard: An Open-Source Agentic Modeling Framework

Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Xiao Yu, Rui Yang · May 14, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields.

Score: 90% High protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Tool Use LawCoding
  • Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments.
  • We present Orchard, an open-source framework for scalable agentic modeling.
Open paper
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson · Apr 24, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Coding
  • In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks.
  • We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently…
Open paper
KLong: Training LLM Agent for Extremely Long-horizon Tasks

Yue Liu, Yingwei Ma, Yibo Miao, Yanhao Li, Yuchong Xie, Xinlong Yang · Feb 19, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Rubric Rating Long Horizon Coding
  • Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics.
  • Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.
Open paper
Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents

Nicholas Edwards, Sebastian Schuster · Mar 27, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Multi Agent Coding
  • We propose an uncertainty-aware multi-agent scaffold that explicitly decouples underspecification detection from code execution.
  • Our results demonstrate that this multi-agent system using OpenHands + Claude Sonnet 4.5 achieves a 69.40% task resolve rate, significantly outperforming a standard single-agent setup (61.20%) and closing the performance gap with agents…
Open paper
CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents

Lintang Sutawika, Aditya Bharat Soni, Bharath Sriraam R R, Apurva Gandhi, Taha Yassine, Sanidhya Vijayvargiya · Mar 18, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Simulation Env Coding
  • A prerequisite for coding agents to perform tasks on large repositories is code localization - the identification of relevant files, classes, and functions to work on.
  • In this paper, we demonstrate that, with an effective reinforcement learning recipe, a coding agent equipped with nothing more than a standard Unix terminal can be trained to achieve strong results.
Open paper
Guardrails Beat Guidance: A Large-Scale Study of Rules, Skills, and Persistent Configuration for Coding Agents

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu · Apr 13, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready
Coding
  • Random rules improve a coding agent's task performance as much as expert-curated ones (both +13.8pp on a discriminative subset of SWE-bench Verified), and in our data every individually beneficial rule is a negative constraint ("do not…
  • We arrive at these findings through the first large-scale controlled study of agent rule files (CLAUDE.md, .cursorrules, and the broader family of agent skills, plugin manifests, and persona definitions): we scrape 679 rule files (25{,}532…
Open paper
From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents

Nikolai Ludwig, Wasi Uddin Ahmad, Somshubra Majumdar, Boris Ginsburg · Apr 2, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready
CodingMultilingual
  • Our empirical results set a new benchmark for open-source models of comparable size.
  • Furthermore, despite being trained exclusively on Python, our agents demonstrate robust zero-shot transferability on SWE-bench Multilingual, reaching 44.1% and confirming the paradigm's generalizability across diverse languages.
Open paper
KAT-Coder-V2 Technical Report

Fengxiang Li, Han Zhang, Haoyang Huang, Jinghui Wang, Jinhua Hao, Kun Yuan · Mar 29, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready
Coding
  • We present KAT-Coder-V2, an agentic coding model developed by the KwaiKAT team at Kuaishou.
  • KAT-Coder-V2 adopts a "Specialize-then-Unify" paradigm that decomposes agentic coding into five expert domains - SWE, WebCoding, Terminal, WebSearch, and General - each undergoing independent supervised fine-tuning and reinforcement…
Open paper

Match reason: Keyword overlap 3/3 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready
Coding
  • AI coding agents can resolve real-world software issues, yet they frequently introduce regressions -- breaking tests that previously passed.
  • When deployed as an agent skill with a different model and framework, TDAD improved issue-resolution rate from 24% to 32%, confirming that surfacing contextual information outperforms prescribing procedural workflows.
Open paper
Hybrid-Gym: Training Coding Agents to Generalize Across Tasks

Yiqing Xie, Emmy Liu, Gaokai Zhang, Nachiket Kotalwar, Shubham Gandhi, Sathwik Acharya · Feb 18, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready
Coding
  • When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench.
  • Experiments show that agents trained on our synthetic tasks effectively generalize to diverse real-world tasks that are not present in training, improving a base model by 25.4% absolute gain on SWE-Bench Verified, 7.9% on SWT-Bench…
Open paper

Match reason: Keyword overlap 3/3 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Multi Agent LawCoding
  • LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly…
  • We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a…
Open paper
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents

Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt, Zijian Wang · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon Coding
  • Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
Open paper
On Randomness in Agentic Evals

Bjarni Haukur Bjarnason, André Silva, Martin Monperrus · Feb 6, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks.
  • To enable reliable evaluation of agentic systems, we recommend three concrete practices: (1) estimate pass@1 from multiple independent runs per task, especially when measuring small improvements, (2) use statistical power analysis to…
Open paper
SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

Yuhang Wang, Yuling Shi, Mo Yang, Rongrui Zhang, Shilin He, Heng Lian · Jan 23, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics Coding
  • In this paper, we propose SWE-Pruner, a self-adaptive context pruning framework tailored for coding agents.
  • Evaluations across four benchmarks and multiple models validate SWE-Pruner's effectiveness in various scenarios, achieving 23-54% token reduction on agent tasks like SWE-Bench Verified while even improving success rates, and up to 14.84x…
Open paper
SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training

Huatong Song, Lisheng Huang, Shuang Sun, Jinhao Jiang, Ran Le, Daixuan Cheng · Feb 3, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready
Long Horizon Coding
  • In this technical report, we present SWE-Master, an open-source and fully reproducible post-training framework for building effective software engineering agents.
  • We evaluate SWE-Master on SWE-bench Verified, a standard benchmark for realistic software engineering tasks.
Open paper
Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He · Dec 31, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready
Long Horizon General
  • We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimizes the production pipeline for agentic model.
  • Empirically, we evaluate ROME within a structured setting and introduce Terminal Bench Pro, a benchmark with improved scale and contamination control.
Open paper
Structurally Aligned Subtask-Level Memory for Software Engineering Agents

Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields.

Score: 68% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon Coding
  • Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
  • Recent work has further explored augmenting these agents with memory mechanisms to support long-horizon reasoning.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.