Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 8 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

APEX-Agents

Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein · Jan 20, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Ready
Rubric RatingExpert Verification Automatic Metrics Long Horizon Law
  • We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate…
  • We test eight agents for the leaderboard using Pass@1.
Open paper
Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

Youliang Yuan, Qiuyang Mang, Jingbang Chen, Hong Wan, Xiaoyuan Liu, Junjielong Xu · Oct 9, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 98% High protocol signal Freshness: Cold Status: Ready
Rubric Rating Automatic Metrics Long Horizon MathLaw
  • Through a systematic analysis with human verification, we establish a taxonomy of these failure modes, identifying patterns like Miracle Steps-abrupt jumps to a correct output without a valid preceding derivation.
  • When integrated into an RL pipeline, RRM-based training consistently outperforms outcome-only supervision across four math benchmarks.
Open paper
Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification

Masnun Nuha Chowdhury, Nusrat Jahan Beg, Umme Hunny Khan, Syed Rifat Raiyan, Md Kamrul Hasan, Hasan Mahmud · Mar 30, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Multi Agent LawCoding
  • We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation.
  • In zero-shot evaluations on the Check-COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi-agent debate by 10.0 percentage points, with P-RAG driving the primary performance gains (+7.5 pp).
Open paper
Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% Moderate protocol signal Freshness: Warm Status: Fallback
Human EvalAutomatic Metrics Law
  • Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
  • Human evaluation of the generated explanations across Clarity, Linking, and Usefulness metrics highlights GPT-4o mini's superior interpretability.
Open paper
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu · Apr 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback
Simulation Env Long Horizon Law
  • Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities.
  • Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific…
Open paper

Match reason: Matched by broad semantic/index fallback.

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics MathLaw
  • We further demonstrate that constructing DPO preference pairs from NSRSA verification teaches the model to distinguish sound from flawed reasoning (reward accuracy 46% to 63%).
Open paper
CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts

Khandakar Shakib Al Hasan, Syed Rifat Raiyan, Hasin Mahtab Alvee, Wahid Sadik · Jan 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Llm As Judge Multi Agent LawCoding
  • To address this, we present CircuitLM, a multi-agent pipeline that translates user prompts into structured, visually interpretable CircuitJSON schematics.
  • To systematically assess performance, we deploy a rigorous dual-layered evaluation methodology: a deterministic Electrical Rule Checking (ERC) engine categorizes topological faults by strict severity (Critical, Major, Minor, Warning), while…
Open paper
ExpGuard: LLM Content Moderation in Specialized Domains

Minseok Choi, Dongjin Kim, Seungbin Yang, Subin Kim, Youngjun Kwak, Juyoung Oh · Mar 3, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Expert Verification LawMedicine
  • With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies.
  • Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial…
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.