Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 434 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,735) General (557) Long Horizon (344) Pairwise Preference (298) Coding (234) Simulation Env (201) Multi Agent (199) Medicine (119) Llm As Judge (113) Expert Verification (102) Human Eval (92) Rubric Rating (85) Web Browsing (84) Math (82) Demonstrations (73) Red Team (67)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection
Apr 24, 2026 · Citations: 0

Across a diverse benchmark of scaling-law tasks, our method consistently outperforms classical design-based baselines, and often approaches the performance of fitting on the full experimental set while using only about 10% of the total…
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
Apr 24, 2026 · Citations: 0

In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks.
Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Apr 24, 2026 · Citations: 0

Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities.
Relaxation-Informed Training of Neural Network Surrogate Models
Apr 24, 2026 · Citations: 0

Experiments on non-convex benchmark functions and a two-stage stochastic programming problem with quantile neural network surrogates demonstrate that the proposed regularizers can reduce MILP solve times by up to four orders of magnitude…
An Undecidability Proof for the Plan Existence Problem
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation
Apr 24, 2026 · Citations: 0

On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base.
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Time-Localized Parametric Decomposition of Respiratory Airflow for Sub-Breath Analysis
Apr 24, 2026 · Citations: 0

Evaluation across 8,276 breaths demonstrates high reconstruction accuracy (mean squared error < 0.001 for four-component models) and robust parameter precision under moderate noise.
CRAFT: Clustered Regression for Adaptive Filtering of Training data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Time-Localized Parametric Decomposition of Respiratory Airflow for Sub-Breath Analysis

Victoria Ribeiro Rodrigues, Paul W. Davenport, Nicholas J. Napoli · Apr 24, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Evaluation across 8,276 breaths demonstrates high reconstruction accuracy (mean squared error < 0.001 for four-component models) and robust parameter precision under moderate noise.

Open paper

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil, Sergio Burdisso, Petr Motlicek, Shiran Liu · Apr 23, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task.
On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics.

Open paper

"This Wasn't Made for Me": Recentering User Experience and Emotional Impact in the Evaluation of ASR Bias

Siyu Liang, Alicia Beckford Wassink · Apr 22, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Coding

Studies on bias in Automatic Speech Recognition (ASR) tend to focus on reporting error rates for speakers of underrepresented dialects, yet less research examines the human side of system bias: how do system failures shape users' lived…

Open paper

Iterative Model-Learning Scheme via Gaussian Processes for Nonlinear Model Predictive Control of (Semi-)Batch Processes

Tai Xuan Tan, Alexander Mitsos, Eike Cramer · Apr 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 64% Moderate protocol signal Freshness: Hot Status: Ready

Long Horizon General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models

Felix Herron, Solange Rossato, Alexandre Allauzen, François Portet · Apr 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 64% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

When Does LLM Self-Correction Help? A Control-Theoretic Markov Diagnostic and Verify-First Intervention

Aofan Liu, Jingxiang Meng · Apr 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 64% Moderate protocol signal Freshness: Hot Status: Ready

Math

Iterative self-correction is widely used in agentic LLM systems, but when repeated refinement helps versus hurts remains unclear.

Open paper

Semantic Error Correction and Decoding for Short Block Channel Codes

Jiafu Hao, Chentao Yue, Wanchun Liu, Yonghui Li, Branka Vucetic · Apr 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 64% Moderate protocol signal Freshness: Hot Status: Ready

Simulation Env Coding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

CURA: Clinical Uncertainty Risk Alignment for Language Model-Based Risk Prediction

Sizhe Wang, Ziqi Xu, Claire Najjuuko, Charles Alba, Chenyang Lu · Apr 16, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Sparse protocol signal Freshness: Hot Status: Ready

Medicine

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing

Sergey V Samsonau · Apr 9, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Sparse protocol signal Freshness: Hot Status: Ready

Coding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Mixed Membership sub-Gaussian Models

Huan Qing · Apr 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Recognition Without Authorization: LLMs and the Moral Order of Online Advice

Tom van Nuenen · Apr 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready

General

Across models, LLMs identify many of the same dynamics as human commenters, but are markedly less likely to convert that recognition into directive authorization for action.
The gap is sharpest where community consensus is strongest: on high-consensus posts involving abuse or safety threats, models recommend exit at roughly half the human rate while maintaining elevated levels of hedging, validation, and…

Open paper

EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms

Brian VanVoorst, Nicholas Walczak, Christopher Gilleo, Charles Meissner, Fabio Felix, Iran Roman · Apr 23, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready

Medicine

Although this paper primarily addresses action detection as the benchmark, the EgoMAGIC dataset is equally suitable for action recognition, object identification and detection, error detection, and other challenging computer vision tasks.

Open paper

Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

Nicolae Filat, Ahmed Hussain, Konstantinos Kalogiannis, Elena Burceanu · Apr 23, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready

General

We argue that this temporal taskification step is not a neutral preprocessing choice, but a structural component of evaluation: different valid splits of the same stream can induce different CL regimes and therefore different benchmark…
Across 9-, 30-, and 44-day splits, we observe substantial changes in forecasting error, forgetting, and backward transfer, showing that taskification alone can materially affect CL evaluation.

Open paper

Error-free Training for MedMNIST Datasets

Bo Deng · Apr 20, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready

Medicine

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

CognitiveTwin: Robust Multi-Modal Digital Twins for Predicting Cognitive Decline in Alzheimer's Disease

Bulent Soykan, Gulsah Hancerliogullari Koksalmis, Hsin-Hsiung Huang, Laura J. Brattain · Apr 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Medicine

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

Pengfeng Li, Chen Huang, Chaoqun Hao, Hongyao Chen, Xiao-Yong Wei, Wenqiang Lei · Apr 13, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Coding

Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
To address this, we pioneer METER to systematically benchmark LLMs across all three levels of the causal ladder under a unified context setting.

Open paper

Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems

Meghana Karnam, Ananya Joshi · Apr 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Fallback

Llm As Judge Long Horizon General

Emerging AI systems in behavioral health and psychiatry use multi-step or multi-agent LLM pipelines for tasks like assessing self-harm risk and screening for depression.
We present a statistical framework for multi-agent pipelines structured as directed acyclic graphs (DAGs) that provides an alternative to heuristic voting with principled, adaptive decision-making.

Open paper

FlowForge: A Staged Local Rollout Engine for Flow-Field Prediction

Xiaowen Zhang, Ziming Zhou, Fengnian Zhao, David L. S. Hung · Apr 21, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% High protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Long Horizon General

Open paper

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su, Lianghao Deng · Apr 13, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Fallback

Simulation Env Multi Agent General

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than…

Open paper

Latent-Condensed Transformer for Efficient Long Context Modeling

Zeng You, Yaofo Chen, Qiuwu Chen, Ying Sun, Shuhai Zhang, Yingjian Li · Apr 14, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent