Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 736 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,620) General (530) Long Horizon (320) Pairwise Preference (288) Coding (218) Simulation Env (187) Multi Agent (182) Medicine (116) Llm As Judge (107) Expert Verification (97) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (77) Demonstrations (67) Critique Edit (63)

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

SODIUM: From Open Web Data to Queryable Databases

Chuxuan Hu, Philip Li, Maxwell Yang, Daniel Kang · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% High protocol signal Freshness: Hot Status: Ready

Expert Verification Automatic Metrics Multi Agent General

Existing systems struggle with SODIUM tasks: we evaluate 6 advanced AI agents on SODIUM-Bench, with the strongest baseline achieving only 46.5% accuracy.
To bridge this gap, we develop SODIUM-Agent, a multi-agent system composed of a web explorer and a cache manager.

Open paper

JUBAKU: An Adversarial Benchmark for Exposing Culturally Grounded Stereotypes in Japanese LLMs

Taihei Shiotani, Masahiro Kaneko, Ayana Niwa, Yuki Maruyama, Daisuke Oba, Masanari Ohi · Mar 21, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Multilingual

To address this limitation, we introduce Japanese cUlture adversarial BiAs benchmarK Under handcrafted creation (JUBAKU), a benchmark tailored to Japanese cultural contexts.
All models clearly exhibited biases on JUBAKU, performing below the random baseline of 50% with an average accuracy of 23% (ranging from 13% to 33%), despite higher accuracy on the other benchmarks.

Open paper

Improving moment tensor solutions under Earth structure uncertainty with simulation-based inference

A. A. Saoulis, T. -S. Pham, A. M. G. Ferreira · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Sparse protocol signal Freshness: Hot Status: Ready

Simulation Env General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Agentic Business Process Management: A Research Manifesto

Diego Calvanese, Angelo Casciani, Giuseppe De Giacomo, Marlon Dumas, Fabiana Fournier, Timotheus Kampik · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Sparse protocol signal Freshness: Hot Status: Ready

Multi Agent General

This paper presents a manifesto that articulates the conceptual foundations of Agentic Business Process Management (APM), an extension of Business Process Management (BPM) for governing autonomous agents executing processes in…
We introduce the core abstractions and architectural elements required to realize APM systems and elaborate on four key capabilities that such APM agents must support: framed autonomy, explainability, conversational actionability, and…

Open paper

Motion-o: Trajectory-Grounded Video Reasoning

Bishoy Galoaa, Shayda Moezzi, Xiangyu Bai, Sarah Ostadabbas · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Sparse protocol signal Freshness: Hot Status: Ready

Long Horizon Coding

At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning.

Open paper

Inducing Sustained Creativity and Diversity in Large Language Models

Queenie Luo, Gary King, Michael Puett, Michael D. Smith · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Sparse protocol signal Freshness: Hot Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis

Sheng Lu, Hao Chen, Rui Yin, Juyan Ba, Yu Zhang, Yuanzhe Li · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Sparse protocol signal Freshness: Hot Status: Ready

Medicine

To advance the development of VLMs for clinical applications, particularly in gastric cancer, we introduce Gastric-X, a large-scale multimodal benchmark for gastric cancer analysis providing 1.7K cases.
Through this evaluation, we aim not only to assess model performance but also to probe the nature of VLM understanding: Can current VLMs meaningfully correlate biochemical signals with spatial tumor features and textual reports?

Open paper

Parallelograms Strike Back: LLMs Generate Better Analogies than People

Qiawen Ella Liu, Raja Marjieh, Jian-Qiao Zhu, Adele E. Goldberg, Thomas L. Griffiths · Mar 19, 2026

Citations: 0

Match reason: Title directly matches "elo".

Score: 80% Sparse protocol signal Freshness: Hot Status: Ready

General

Four-term word analogies (A:B::C:D) are classically modeled geometrically as ''parallelograms,'' yet recent work suggests this model poorly captures how humans produce analogies, with simple local-similarity heuristics often providing a…
Crucially, we show that the improvement over human analogies was driven by greater parallelogram alignment and reduced reliance on accessible words rather than enhanced sensitivity to local similarity.

Open paper

Man and machine: artificial intelligence and judicial decision making

Arthur Dyevre, Ahmad Shahvaroughi · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Sparse protocol signal Freshness: Hot Status: Ready

Law

At the same time, these developments have brought the limitations of human judgment into sharper relief and underscored the importance of understanding how judges interact with AI-based decision aids.
Using criminal justice risk assessment as a focal case, we conduct a synthetic review connecting three intertwined aspects of AI's role in judicial decision-making: the performance and fairness of AI tools, the strengths and biases of human…

Open paper

Foundations of Schrödinger Bridges for Generative Modeling

Sophia Tang · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Sparse protocol signal Freshness: Hot Status: Ready

Math

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Large-Scale Analysis of Political Propaganda on Moltbook

Julia Jose, Meghna Manoj Nair, Rachel Greenstadt · Mar 18, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Sparse protocol signal Freshness: Hot Status: Ready

General

We present an NLP-based study of political propaganda on Moltbook, a Reddit-style platform for AI agents.
4% of agents produced 51% of these posts.

Open paper

ReViSQL: Achieving Human-Level Text-to-SQL

Yuxuan Zhu, Tengjun Jin, Yoojin Choi, Daniel Kang · Mar 20, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

However, despite these extensive architectural engineering efforts, a significant gap remains: even state-of-the-art (SOTA) AI agents have not yet achieved the human-level accuracy on the BIRD benchmark.
We introduce ReViSQL, a streamlined framework that achieves human-level accuracy on BIRD for the first time.

Open paper

I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems

Vedanta S P, Ponnurangam Kumaraguru · Mar 19, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Rubric Rating Simulation Env Multi Agent General

Large language models are increasingly proposed as autonomous agents for high-stakes public workflows, yet we lack systematic evidence about whether they would follow institutional rules when granted authority.
We evaluate multi-agent governance simulations in which agents occupy formal governmental roles under different authority structures, and we score rule-breaking and abuse outcomes with an independent rubric-based judge across 28,112…

Open paper

Mi:dm K 2.5 Pro

KT Tech innovation Group · Mar 19, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Long Horizon MathCoding

The evolving LLM landscape requires capabilities beyond simple text generation, prioritizing multi-step reasoning, long-context understanding, and agentic workflows.
The evaluations show that Mi:dm K 2.5 Pro achieves competitive performance against leading global and domestic models.

Open paper

Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

Zou Qiang · Mar 19, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready

Simulation Env General

Existing safety approaches -- such as reinforcement learning from human feedback (RLHF) and output filtering -- primarily operate at the behavioral level and may lack explicit architectural mechanisms for enforcing reasoning process…
We introduce preliminary simulation-based evaluation involving progressive boundary erosion scenarios across multiple heterogeneous LLM systems (DeepSeek-V3, Doubao, Qwen).

Open paper

Conflict-Based Search for Multi Agent Path Finding with Asynchronous Actions

Xuemian Wu, Shizhe Zhao, Zhongqiang Ren · Mar 19, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready

Multi Agent General

Multi-Agent Path Finding (MAPF) seeks collision-free paths for multiple agents from their respective start locations to their respective goal locations while minimizing path costs.
Most existing MAPF algorithms rely on a common assumption of synchronized actions, where the actions of all agents start at the same time and always take a time unit, which may limit the use of MAPF planners in practice.

Open paper

Online Learning and Equilibrium Computation with Ranking Feedback

Mingyang Liu, Yongshan Chen, Zhiyuan Fan, Gabriele Farina, Asuman Ozdaglar, Kaiqing Zhang · Mar 19, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

Most existing online learning algorithms rely on numeric utility feedback from the environment, which may be unavailable in human-in-the-loop applications and/or may be restricted by privacy concerns.

Open paper

Security, privacy, and agentic AI in a regulatory view: From definitions and distinctions to provisions and reflections

Shiliang Zhang, Sabita Maharjan · Mar 19, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

Law

As AI paradigms shift towards greater autonomy, specifically in the form of agentic AI, it becomes increasingly challenging to precisely articulate regulatory stipulations.
This challenge is even more acute in the domains of security and privacy, where the capabilities of autonomous agents often blur traditional legal and technical boundaries.

Open paper

Automatic detection of Gen-AI texts: A comparative framework of neural models

Cristian Buttaro, Irene Amerini · Mar 19, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

Multilingual

The rapid proliferation of Large Language Models has significantly increased the difficulty of distinguishing between human-written and AI generated texts, raising critical issues across academic, editorial, and social domains.
This paper investigates the problem of AI generated text detection through the design, implementation, and comparative evaluation of multiple machine learning based detectors.

Open paper

From Noise to Signal: When Outliers Seed New Topics

Evangelia Zve, Gauvain Bourgne, Benjamin Icard, Jean-Gabriel Ganascia · Mar 18, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent