Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 46 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,735) General (557) Long Horizon (344) Pairwise Preference (298) Coding (234) Simulation Env (201) Multi Agent (199) Medicine (119) Llm As Judge (113) Expert Verification (102) Human Eval (92) Rubric Rating (85) Web Browsing (84) Math (82) Demonstrations (73) Red Team (67)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection
Apr 24, 2026 · Citations: 0

Across a diverse benchmark of scaling-law tasks, our method consistently outperforms classical design-based baselines, and often approaches the performance of fitting on the full experimental set while using only about 10% of the total…
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
Apr 24, 2026 · Citations: 0

In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks.
Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Apr 24, 2026 · Citations: 0

Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities.
Relaxation-Informed Training of Neural Network Surrogate Models
Apr 24, 2026 · Citations: 0

Experiments on non-convex benchmark functions and a two-stage stochastic programming problem with quantile neural network surrogates demonstrate that the proposed regularizers can reduce MILP solve times by up to four orders of magnitude…
An Undecidability Proof for the Plan Existence Problem
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation
Apr 24, 2026 · Citations: 0

On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base.
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Time-Localized Parametric Decomposition of Respiratory Airflow for Sub-Breath Analysis
Apr 24, 2026 · Citations: 0

Evaluation across 8,276 breaths demonstrates high reconstruction accuracy (mean squared error < 0.001 for four-component models) and robust parameter precision under moderate noise.
CRAFT: Clustered Regression for Adaptive Filtering of Training data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li, Xander Xu · Feb 15, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 58% High protocol signal Freshness: Warm Status: Ready

Expert VerificationCritique Edit Automatic Metrics Law

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons.

Open paper

APEX-Agents

Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein · Jan 20, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 58% High protocol signal Freshness: Warm Status: Ready

Rubric RatingExpert Verification Automatic Metrics Long Horizon Law

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate…
We test eight agents for the leaderboard using Pass@1.

Open paper

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li, Yuhang Xu, Zheng Sun, Junjie Jiang, Hexuan Jin · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback

Simulation Env Long Horizon Law

To complement this conceptual framework, we introduce CoW-Bench, a benchmark centered on multi-frame reasoning and generation scenarios.
CoW-Bench evaluates both video generation models and UMMs under a unified evaluation protocol.

Open paper

Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation

Yonathan Ron, Shiri Gilboa, Tammuz Dubnov · Feb 21, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Multi Agent Law

We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining.
The pipeline intercepts Whisper's initial transcript, applies specialized LLM agents for domain context identification, named entity recognition, and jargon detection, and generates compact prompts that guide Whisper's decoder.

Open paper

Multimodal Multi-Agent Empowered Legal Judgment Prediction

Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu, Rong Fu · Jan 19, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback

Simulation Env Multi Agent Law

Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation.
Experiments on JurisMM and the benchmark LawBench validate our framework's effectiveness.

Open paper

Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space

Wang Zixian · Jan 18, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Long Horizon MathLaw

Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.

Open paper

Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System

Pavithra PM Nair, Preethu Rose Anish · Feb 20, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback

Human EvalAutomatic Metrics Law

Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
Human evaluation of the generated explanations across Clarity, Linking, and Usefulness metrics highlights GPT-4o mini's superior interpretability.

Open paper

CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts

Khandakar Shakib Al Hasan, Syed Rifat Raiyan, Hasin Mahtab Alvee, Wahid Sadik · Jan 8, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback

Llm As Judge Multi Agent LawCoding

To address this, we present CircuitLM, a multi-agent pipeline that translates user prompts into structured, visually interpretable CircuitJSON schematics.
To systematically assess performance, we deploy a rigorous dual-layered evaluation methodology: a deterministic Electrical Rule Checking (ERC) engine categorizes topological faults by strict severity (Critical, Major, Minor, Warning), while…

Open paper

Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills

Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu · Dec 18, 2025

Citations: 0

Match reason: Matches selected tags (Law).

Score: 53% High protocol signal Freshness: Cold Status: Ready

Pairwise Preference Automatic Metrics Tool Use Law

Large language model (LLM) agents are moving beyond prompting alone.
ChatGPT marked the rise of general-purpose LLM assistants, DeepSeek showed that on-policy reinforcement learning with verifiable rewards can improve reasoning and tool use, and OpenClaw highlights a newer direction in which agents…

Open paper

Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers

Tuhin Chakrabarty, Jane C. Ginsburg, Paramveer Dhillon · Oct 15, 2025

Citations: 0

Match reason: Matches selected tags (Law).

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready

Pairwise Preference Automatic Metrics Law

In blind pairwise evaluations by 28 MFA-trained readers and 516 college-educated general readers, AI text from in-context prompting was strongly disfavored by MFA readers for stylistic fidelity (OR=0.16) and quality (OR=0.13), while general…
Fine-tuning ChatGPT on authors' complete works reversed these results: MFA readers favored AI for fidelity (OR=8.16) and quality (OR=1.87), with general readers showing even stronger preference (fidelity OR=16.65; quality OR=5.42).

Open paper

Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

Youliang Yuan, Qiuyang Mang, Jingbang Chen, Hong Wan, Xiaoyuan Liu, Junjielong Xu · Oct 9, 2025

Citations: 0

Match reason: Matches selected tags (Law).

Score: 53% High protocol signal Freshness: Cold Status: Ready

Rubric Rating Automatic Metrics Long Horizon MathLaw

Through a systematic analysis with human verification, we establish a taxonomy of these failure modes, identifying patterns like Miracle Steps-abrupt jumps to a correct output without a valid preceding derivation.
When integrated into an RL pipeline, RRM-based training consistently outperforms outcome-only supervision across four math benchmarks.

Open paper

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li, Ruifeng Xu · Sep 17, 2025

Citations: 0

Match reason: Matches selected tags (Law).

Score: 53% High protocol signal Freshness: Cold Status: Ready

Red Team Automatic Metrics Law

This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses.
In addition, the assessment of defenses on the constructed safe prompts reveals inherent limitations of LLMs' safety mechanisms and flaws in the defense methods.

Open paper

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · Feb 18, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Red Team LawMultilingual

LLM-based agents execute real-world workflows via tools and memory.
We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive…

Open paper

Learning Page Order in Shuffled WOO Releases

Efe Kahraman, Giulio Tosato · Feb 11, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise Preference Law

Open paper

The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage

Preni Golazizian, Elnaz Rahmati, Jackson Trager, Zhivar Sourati, Nona Ghazizadeh, Georgios Chochlakis · Feb 10, 2026

Citations: 0

Match reason: Matches selected tags (Law).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise PreferenceRubric Rating Law

By sampling annotators from police-affiliated, justice-system-impacted, and non-affiliated Los Angeles residents, we enable the systematic study of perceptual differences across diverse communities.
To this end, we (i) develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) introduce a rubric-driven preference data construction framework for…

Open paper

Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors

Qiming Bao, Xiaoxuan Fu, Michael Witbrock · Dec 6, 2025

Citations: 0

Match reason: Matches selected tags (Law).

Score: 53% Moderate protocol signal Freshness: Cold Status: Fallback

Automatic Metrics Long Horizon Law

We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs.

Open paper

Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs

Ziliang Wang, Kang An, Xuhui Zheng, Faqiang Qian, Weikun Zhang, Cijun Ouyang · Oct 1, 2025

Citations: 0

Match reason: Matches selected tags (Law).

Score: 53% High protocol signal Freshness: Cold Status: Fallback

Automatic Metrics Long Horizon Law

Open paper

L-MARS: Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search

Ziqi Wang, Boqin Yuan · Aug 31, 2025

Citations: 0

Match reason: Matches selected tags (Law).

Score: 53% Moderate protocol signal Freshness: Cold Status: Fallback

Automatic Metrics Multi Agent LawCoding

We present L-MARS (Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search), a multi-agent retrieval framework for grounded legal question answering that decomposes queries into structured sub-problems, retrieves evidence…
We introduce LegalSearchQA, a 50-question benchmark across five legal domains whose answers depend on recent developments that post-date model training data.

Open paper

Point of Order: Action-Aware LLM Persona Modeling for Realistic Civic Simulation

Scott Merrill, Shashank Srivastava · Nov 21, 2025

Citations: 0

Match reason: Matches selected tags (Law).

Score: 50% Moderate protocol signal Freshness: Cold Status: Fallback

Human EvalSimulation Env Law

Transcripts produced via automatic speech recognition (ASR) assign anonymous speaker labels (e.g., Speaker_1), preventing models from capturing consistent human behavior.
Turing-style human evaluations show our simulations are often indistinguishable from real deliberations, providing a practical and scalable method for complex realistic civic simulations.

Open paper

Sub-exponential Growth Dynamics in Complex Systems: A Piecewise Power-Law Model for the Diffusion of New Words and Names

Hayafumi Watanabe · Nov 6, 2025

Citations: 0

Match reason: Matches selected tags (Law).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback

Pairwise Preference Law

inward (community) contact suggests that α can be interpreted as an index of the preference for outward-oriented communication.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent