Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 176 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,735) General (557) Long Horizon (344) Pairwise Preference (298) Coding (234) Simulation Env (201) Multi Agent (199) Medicine (119) Llm As Judge (113) Expert Verification (102) Human Eval (92) Rubric Rating (85) Web Browsing (84) Math (82) Demonstrations (73) Red Team (67)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection
Apr 24, 2026 · Citations: 0

Across a diverse benchmark of scaling-law tasks, our method consistently outperforms classical design-based baselines, and often approaches the performance of fitting on the full experimental set while using only about 10% of the total…
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
Apr 24, 2026 · Citations: 0

In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks.
Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Apr 24, 2026 · Citations: 0

Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities.
Relaxation-Informed Training of Neural Network Surrogate Models
Apr 24, 2026 · Citations: 0

Experiments on non-convex benchmark functions and a two-stage stochastic programming problem with quantile neural network surrogates demonstrate that the proposed regularizers can reduce MILP solve times by up to four orders of magnitude…
An Undecidability Proof for the Plan Existence Problem
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation
Apr 24, 2026 · Citations: 0

On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base.
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Time-Localized Parametric Decomposition of Respiratory Airflow for Sub-Breath Analysis
Apr 24, 2026 · Citations: 0

Evaluation across 8,276 breaths demonstrates high reconstruction accuracy (mean squared error < 0.001 for four-component models) and robust parameter precision under moderate noise.
CRAFT: Clustered Regression for Adaptive Filtering of Training data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Dynamically Acquiring Text Content to Enable the Classification of Lesser-known Entities for Real-world Tasks

Fahmida Alam, Ellen Riloff · Apr 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Expert Verification Automatic Metrics MedicineCoding

Open paper

How Much LLM Does a Self-Revising Agent Actually Need?

Sungwoo Jung, Seonil Son · Apr 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Critique Edit Automatic Metrics General

Recent LLM-based agents often place world modeling, planning, and reflection inside a single language model loop.
We introduce a declared reflective runtime protocol that externalizes agent state, confidence signals, guarded actions, and hypothetical transitions into inspectable runtime structure.

Open paper

Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering

Elyas Irankhah, Samah Fodeh · Apr 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Expert Verification Automatic Metrics Medicine

Open paper

SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA)

Liang-Chih Yu, Jonas Becker, Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Lung-Hao Lee, Ying-Lung Lin · Apr 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan, Elliot M. Fielstein · Apr 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% High protocol signal Freshness: Hot Status: Ready

Expert Verification Automatic Metrics MedicineMultilingual

Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale.
Using judge-evaluated outputs as references, the primary LLM achieved an F1 score of 0.80 under relaxed matching criteria.

Open paper

Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation

Rajinder Sandhu, Di Mu, Cheng Chang, Md Shahriar Tasjid, Himanshu Rai, Maksims Volkovs · Apr 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base.

Open paper

ReLeVAnT: Relevance Lexical Vectors for Accurate Legal Text Classification

Ishaan Gakhar, Harsh Nandwani · Apr 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Law

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Machine learning and emoji prediction: How much accuracy can MARBERT achieve?

Mohammed Q. Shormani, Ibrahim Abdulmalik Hassan Muneef Y. Alshawsh · Apr 22, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

A GAN and LLM-Driven Data Augmentation Framework for Dynamic Linguistic Pattern Modeling in Chinese Sarcasm Detection

Wenxian Wang, Xiaohu Luo, Junfeng Hao, Xiaoming Gu, Xingshu Chen, Zhu Wang · Apr 9, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Clickbait detection: quick inference with maximum impact

Soveatin Kuntur, Panggih Kusuma Ningrum, Anna Wróblewska, Maria Ganzha, Marcin Paprzycki · Apr 9, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Graph Neural Networks for Misinformation Detection: Performance-Efficiency Trade-offs

Soveatin Kuntur, Maciej Krzywda, Anna Wróblewska, Marcin Paprzycki, Maria Ganzha, Szymon Łukasik · Apr 9, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

In this work, we benchmark graph neural networks (GNNs) against non-graph-based machine learning methods under controlled and comparable conditions.

Open paper

Are Non-English Papers Reviewed Fairly? Language-of-Study Bias in NLP Peer Reviews

Ehsan Barkhordar, Abdulfattah Safa, Verena Blaschke, Erika Lombart, Marie-Catherine de Marneffe, Gözde Gül Şahin · Apr 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Multilingual

We present the first systematic characterization of LoS bias, distinguishing negative and positive forms, and introduce the human-annotated dataset LOBSTER (Language-Of-study Bias in ScienTific pEer Review) and a method achieving 87.37…

Open paper

IndoBERT-Sentiment: Context-Conditioned Sentiment Classification for Indonesian Text

Muhammad Apriandito Arya Saputra, Andry Alamsyah, Dian Puteri Ramadhani, Thomhert Suprapto Siadari, Hanif Fakhrurroja · Apr 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

In a head-to-head evaluation against three widely used general-purpose Indonesian sentiment models on the same test set, IndoBERT-Sentiment outperforms the best baseline by 35.6 F1 points.

Open paper

Environmental, Social and Governance Sentiment Analysis on Slovene News: A Novel Dataset and Models

Paula Dodig, Boshko Koloski, Katarina Sitar Šuštar, Senja Pollak, Matthew Purver · Apr 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Multilingual

The dataset, derived from the MaCoCu Slovene news collection, combines large language model (LLM)-assisted filtering with human annotation of company-related ESG content.

Open paper

LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources

Joshua Castillo, Ravi Mukkamala · Apr 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Missing-person and child-safety investigations rely on heterogeneous case documents, including structured forms, bulletin-style posters, and narrative web profiles.

Open paper

FMI@SU ToxHabits: Evaluating LLMs Performance on Toxic Habit Extraction in Spanish Clinical Texts

Sylvia Vassileva, Ivan Koychev, Svetla Boytcheva · Apr 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Medicine

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

"I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?

Naen Xu, Jiayi Sheng, Changjiang Li, Chunyi Zhou, Yuyuan Li, Tianyu Du · Apr 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks.
Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors.

Open paper

Just Pass Twice: Efficient Token Classification with LLMs for Zero-Shot NER

Ahmed Ewais, Ahmed Hashish, Amr Ali · Apr 6, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Our approach achieves state-of-the-art results on zero-shot NER benchmarks, surpassing the previous best method by +7.9 F1 on average across CrossNER and MIT benchmarks, being over 20x faster than comparable generative methods.

Open paper

StoryScope: Investigating idiosyncrasies in AI fiction

Jenna Russell, Rishanth Rajendhran, Mohit Iyyer, John Wieting · Apr 3, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

While most existing work in this space focuses on identifying surface-level signatures of AI writing, we ask instead whether AI-generated stories can be distinguished from human ones without relying on stylistic signals, focusing on…
Narrative features alone achieve 93.2% macro-F1 for human vs.

Open paper

Weakly Supervised Distillation of Hallucination Signals into Transformer Representations

Shoaib Sadiq Salehmohamed, Jinal Prashant Thakkar, Hansika Aredla, Shaik Mohammed Omar, Shalmali Ayachit · Apr 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Fallback

Llm As JudgeAutomatic Metrics General

We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without…
Transformer-based probes achieve the strongest discrimination, with M2 performing best on 5-fold average AUC/F1, and M3 performing best on both single-fold validation and held-out test evaluation.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent