Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 10 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

Tianyang Han, Hengyu Shi, Junjie Hu, Xu Yang, Zhiling Wang, Junhao Su · May 5, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Rubric Rating).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Rubric Rating Automatic Metrics Long Horizon MathLaw
  • Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only…
Open paper

Match reason: Matches selected tags (Long Horizon, Rubric Rating).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Rubric Rating Automatic Metrics Long Horizon General
  • Multi-agent debate (MAD), and more broadly closed-system reasoning where agents iteratively transform each other's outputs, tends to preserve answer accuracy while degrading the reasoning behind those answers.
  • An R6 cohort study (Korean n=10x30 FEVER; English n=3x200 SciFact) finds inter-rater Fleiss kappa <= +0.018 with 0.8-1.4 Likert intra-rater shifts across language and domain -- the human agreement that faithfulness metrics have been…
Open paper
\$OneMillion-Bench: How Far are Language Agents from Human Experts?

Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen, Kaiyuan Chen · Mar 9, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Rubric Rating).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Rubric Rating Automatic Metrics Tool Use Law
  • To this end, we introduce \OneMillion-Bench \OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios.
  • We adopt a rubric-based evaluation protocol scoring factual accuracy, logical coherence, practical feasibility, and professional compliance, focused on expert-level problems to ensure meaningful differentiation across agents.
Open paper
KLong: Training LLM Agent for Extremely Long-horizon Tasks

Yue Liu, Yingwei Ma, Yibo Miao, Yanhao Li, Yuchong Xie, Xinlong Yang · Feb 19, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Rubric Rating).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Rubric Rating Long Horizon Coding
  • Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics.
  • Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.
Open paper
APEX-Agents

Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein · Jan 20, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Rubric Rating).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Rubric RatingExpert Verification Automatic Metrics Long Horizon Law
  • We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate…
  • We test eight agents for the leaderboard using Pass@1.
Open paper
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

Michael Krumdick, Varshini Reddy, Shivani Chaudhary, William Day, Maarij Ahmed, Hayan Haqqi · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Rubric Rating).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Rubric Rating Long Horizon General
  • To address this, we introduce FrontierFinance, a long-horizon benchmark of 25 complex financial modeling tasks across five core finance models, requiring an average of over 18 hours of skilled human labor per task to complete.
  • We demonstrate that our human experts both receive higher scores on average, and are more likely to provide client-ready outputs than current state-of-the-art systems.
Open paper
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian, Haihua Yang · Feb 13, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Rubric Rating).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise PreferenceRubric Rating Long Horizon Medicine
  • MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities.
  • For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision…
Open paper
MIND: Unified Inquiry and Diagnosis RL with Criteria Grounded Clinical Supports for Psychiatric Consultation

Guoyi Li, Shihao Xu, Jiatong Ma, Yunyun Han, Jianhua Chen, Yafeng Deng · Mar 4, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Rubric Rating).

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Rubric Rating Automatic Metrics Long Horizon Medicine
  • Large language models (LLMs) have advanced medical dialogue systems, yet psychiatric consultation poses substantially higher demands due to subjective ambiguity and comorbidity complexity: an agent must continuously extract…
Open paper
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright, Marcus Yearwood · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Long Horizon, Rubric Rating).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon General
  • Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
  • We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations.
Open paper
DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation

Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song · Dec 19, 2025

Citations: 0

Match reason: Matches selected tags (Long Horizon, Rubric Rating).

Score: 50% Moderate protocol signal Freshness: Cold Status: Ready
Rubric RatingExpert Verification Long Horizon General
  • However, evaluating such reports remains challenging: report quality is multifaceted, making it difficult to determine what to assess and by what criteria; LLM-based judges may miss errors that require domain expertise to identify; and…
  • To address these issues, we propose DEER, a benchmark for evaluating expert-level deep research reports.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.