Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 46 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li, Xander Xu · Feb 15, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Expert VerificationCritique Edit Automatic Metrics Law
  • Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
  • However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons.
Open paper
APEX-Agents

Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein · Jan 20, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Rubric RatingExpert Verification Automatic Metrics Long Horizon Law
  • We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate…
  • We test eight agents for the leaderboard using Pass@1.
Open paper
The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li, Yuhang Xu, Zheng Sun, Junjie Jiang, Hexuan Jin · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Simulation Env Long Horizon Law
  • To complement this conceptual framework, we introduce CoW-Bench, a benchmark centered on multi-frame reasoning and generation scenarios.
  • CoW-Bench evaluates both video generation models and UMMs under a unified evaluation protocol.
Open paper
Citations: 0

Match reason: Matches selected tags (Law).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Multi Agent Law
  • We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining.
  • The pipeline intercepts Whisper's initial transcript, applies specialized LLM agents for domain context identification, named entity recognition, and jargon detection, and generates compact prompts that guide Whisper's decoder.
Open paper
Multimodal Multi-Agent Empowered Legal Judgment Prediction

Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu, Rong Fu · Jan 19, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Simulation Env Multi Agent Law
  • Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation.
  • Experiments on JurisMM and the benchmark LawBench validate our framework's effectiveness.
Open paper
Citations: 0

Match reason: Matches selected tags (Law).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Human EvalAutomatic Metrics Law
  • Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
  • Human evaluation of the generated explanations across Clarity, Linking, and Usefulness metrics highlights GPT-4o mini's superior interpretability.
Open paper
CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts

Khandakar Shakib Al Hasan, Syed Rifat Raiyan, Hasin Mahtab Alvee, Wahid Sadik · Jan 8, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Llm As Judge Multi Agent LawCoding
  • To address this, we present CircuitLM, a multi-agent pipeline that translates user prompts into structured, visually interpretable CircuitJSON schematics.
  • To systematically assess performance, we deploy a rigorous dual-layered evaluation methodology: a deterministic Electrical Rule Checking (ERC) engine categorizes topological faults by strict severity (Critical, Major, Minor, Warning), while…
Open paper
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills

Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu · Dec 18, 2025

Citations: 0

Match reason: Matches selected tags (Law).

Score: 53% High protocol signal Freshness: Cold Status: Ready
Pairwise Preference Automatic Metrics Tool Use Law
  • Large language model (LLM) agents are moving beyond prompting alone.
  • ChatGPT marked the rise of general-purpose LLM assistants, DeepSeek showed that on-policy reinforcement learning with verifiable rewards can improve reasoning and tool use, and OpenClaw highlights a newer direction in which agents…
Open paper
Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers

Tuhin Chakrabarty, Jane C. Ginsburg, Paramveer Dhillon · Oct 15, 2025

Citations: 0

Match reason: Matches selected tags (Law).

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise Preference Automatic Metrics Law
  • In blind pairwise evaluations by 28 MFA-trained readers and 516 college-educated general readers, AI text from in-context prompting was strongly disfavored by MFA readers for stylistic fidelity (OR=0.16) and quality (OR=0.13), while general…
  • Fine-tuning ChatGPT on authors' complete works reversed these results: MFA readers favored AI for fidelity (OR=8.16) and quality (OR=1.87), with general readers showing even stronger preference (fidelity OR=16.65; quality OR=5.42).
Open paper
Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

Youliang Yuan, Qiuyang Mang, Jingbang Chen, Hong Wan, Xiaoyuan Liu, Junjielong Xu · Oct 9, 2025

Citations: 0

Match reason: Matches selected tags (Law).

Score: 53% High protocol signal Freshness: Cold Status: Ready
Rubric Rating Automatic Metrics Long Horizon MathLaw
  • Through a systematic analysis with human verification, we establish a taxonomy of these failure modes, identifying patterns like Miracle Steps-abrupt jumps to a correct output without a valid preceding derivation.
  • When integrated into an RL pipeline, RRM-based training consistently outperforms outcome-only supervision across four math benchmarks.
Open paper
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li, Ruifeng Xu · Sep 17, 2025

Citations: 0

Match reason: Matches selected tags (Law).

Score: 53% High protocol signal Freshness: Cold Status: Ready
Red Team Automatic Metrics Law
  • This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses.
  • In addition, the assessment of defenses on the constructed safe prompts reveals inherent limitations of LLMs' safety mechanisms and flaws in the defense methods.
Open paper
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · Feb 18, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Red Team LawMultilingual
  • LLM-based agents execute real-world workflows via tools and memory.
  • We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive…
Open paper
The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage

Preni Golazizian, Elnaz Rahmati, Jackson Trager, Zhivar Sourati, Nona Ghazizadeh, Georgios Chochlakis · Feb 10, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise PreferenceRubric Rating Law
  • By sampling annotators from police-affiliated, justice-system-impacted, and non-affiliated Los Angeles residents, we enable the systematic study of perceptual differences across diverse communities.
  • To this end, we (i) develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) introduce a rubric-driven preference data construction framework for…
Open paper

Match reason: Matches selected tags (Law).

Score: 53% Moderate protocol signal Freshness: Cold Status: Fallback
Automatic Metrics Multi Agent LawCoding
  • We present L-MARS (Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search), a multi-agent retrieval framework for grounded legal question answering that decomposes queries into structured sub-problems, retrieves evidence…
  • We introduce LegalSearchQA, a 50-question benchmark across five legal domains whose answers depend on recent developments that post-date model training data.
Open paper
Citations: 0

Match reason: Matches selected tags (Law).

Score: 50% Moderate protocol signal Freshness: Cold Status: Fallback
Human EvalSimulation Env Law
  • Transcripts produced via automatic speech recognition (ASR) assign anonymous speaker labels (e.g., Speaker_1), preventing models from capturing consistent human behavior.
  • Turing-style human evaluations show our simulations are often indistinguishable from real deliberations, providing a practical and scalable method for complex realistic civic simulations.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.