Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 731 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,610) General (530) Long Horizon (319) Pairwise Preference (287) Coding (216) Simulation Env (186) Multi Agent (182) Medicine (115) Llm As Judge (106) Expert Verification (97) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (77) Demonstrations (67) Critique Edit (63)

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences

Simona-Vasilica Oprea, Adela Bâra · Apr 1, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Automatic Metrics General

Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task.
Beyond accuracy, we integrate SHAP and LIME to provide fine-grained interpretability, revealing that model decisions depend on contextualized safety and supportive framing rather than isolated keywords.

Open paper

Omni-SimpleMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory

Jiaqi Liu, Zipeng Ling, Shi Qiu, Yanqing Liu, Siwei Han, Peng Xia · Apr 1, 2026

Citations: 0

Match reason: Title directly matches "elo".

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Coding

Starting from a naïve baseline (F1=0.117 on LoCoMo), the pipeline autonomously executes {\sim}50 experiments across two benchmarks, diagnosing failure modes, proposing architectural modifications, and repairing data pipeline bugs, all…
The resulting system achieves state-of-the-art on both benchmarks, improving F1 by +411% on LoCoMo (0.117\to0.598) and +214% on Mem-Gallery (0.254\to0.797) relative to the initial configurations.

Open paper

Structural Feature Engineering for Generative Engine Optimization: How Content Structure Shapes Citation Behavior

Junwei Yu, Mufeng Yang, Yepeng Ding, Hiroyuki Sato · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Sparse protocol signal Freshness: Hot Status: Ready

Web Browsing General

Experimental evaluation across six mainstream generative engines demonstrates consistent improvements in citation rate (17.3 percent) and subjective quality (18.5 percent), validating the effectiveness and generalizability of the proposed…

Open paper

PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor

Yutao Yang, Junsong Li, Qianjun Pan, Jie Zhou, Kai Chen, Qin Chen · Apr 1, 2026

Citations: 0

Match reason: Title directly matches "elo".

Score: 80% Sparse protocol signal Freshness: Hot Status: Ready

Medicine

However, this contrasts with human experts, who continuously refine their proficiency through clinical practice and accumulated experience.
To bridge this gap, we propose an Experience-Driven Lifelong Learning Agent (PsychAgent) for psychological counseling.

Open paper

Tracking Equivalent Mechanistic Interpretations Across Neural Networks

Alan Sun, Mariya Toneva · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Sparse protocol signal Freshness: Hot Status: Ready

General

Our framework lays a foundation for the development of more rigorous evaluation methods of MI and automated, generalizable interpretation discovery methods.

Open paper

ENEIDE: A High Quality Silver Standard Dataset for Named Entity Recognition and Linking in Historical Italian

Cristian Santini, Sebastian Barzaghi, Paolo Sernani, Emanuele Frontoni, Laura Melosi, Mehwish Alam · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Sparse protocol signal Freshness: Hot Status: Ready

General

The dataset's diachronic coverage spanning two centuries makes it particularly suitable for temporal entity disambiguation and cross-domain evaluation.

Open paper

Baby Scale: Investigating Models Trained on Individual Children's Language Input

Steven Y. Feng, Alvin W. M. Tan, Michael C. Frank · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Sparse protocol signal Freshness: Hot Status: Ready

General

Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior.
Assessing the nature and origins of this "data gap" requires benchmarking LMs on human-scale datasets to understand how linguistic knowledge emerges from children's natural training data.

Open paper

Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning

Cai Zhou, Zekai Wang, Menghua Wu, Qianyu Julie Zhu, Flora C. Shi, Chenyu Wang · Apr 1, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% High protocol signal Freshness: Hot Status: Ready

Automatic Metrics MathCoding

Under zero-shot out-of-domain settings, it improves MATH-500 savings from 24.8% of the static calibration baseline to 67.0% while maintaining a low empirical error rate, and the same trend holds across model families and downstream…

Open paper

LLM Probe: Evaluating LLMs for Low-Resource Languages

Hailay Kidu Teklehaymanot, Gebrearegawi Gebremariam, Wolfgang Nejdl · Mar 31, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Multilingual

Despite rapid advances in large language models (LLMs), their linguistic abilities in low-resource and morphologically rich languages are still not well understood due to limited annotated resources and the absence of standardized…
To illustrate the framework, we create a manually annotated benchmark dataset using a low-resource Semitic language as a case study.

Open paper

Beyond Symbolic Solving: Multi Chain-of-Thought Voting for Geometric Reasoning in Large Language Models

Md. Abu Bakor Siddique, Shahrin Hossain, Sadman Ahmed Siam, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan · Apr 1, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics MathCoding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

HippoCamp: Benchmarking Contextual Agents on Personal Computers

Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu, Hoang-Nhat Nguyen, Yichi Zhang · Apr 1, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Tool Use Medicine

We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management.
We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp.

Open paper

S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

Jack Young · Apr 1, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% High protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Long Horizon MathCoding

Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval.
Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism.

Open paper

Common TF-IDF variants arise as key components in the test statistic of a penalized likelihood-ratio test for word burstiness

Zeyad Ahmed, Paul Sheridan, Michael McIsaac, Aitazaz A. Farooque · Apr 1, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Locally Confident, Globally Stuck: The Quality-Exploration Dilemma in Diffusion Language Models

Liancheng Fang, Aiwei Liu, Henry Peng Zou, Yankai Chen, Enze Ma, Leyi Pan · Apr 1, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready

Math

Experiments across a range of reasoning benchmarks including MATH500, AIME24/25, HumanEval, and MBPP show that our approach yields better exploration-quality tradeoff than both random and low-confidence remasking.

Open paper

Training-Free Dynamic Upcycling of Expert Language Models

Eros Fanì, Oğuzhan Ersoy · Mar 31, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Fallback

Expert Verification Coding

Open paper

True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies

Graziano Blasilli, Marco Angelini · Apr 1, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

To establish a human perspective on these tasks, we also conducted a user study with visualization experts to assess how people perceive rhetorical techniques and the authorial intentions behind the same misleading visualizations.
This allows comparison between model and expert behavior, revealing similarities and differences that provide insights into where LLMs align with human judgment and where they diverge.

Open paper

Phase transition on a context-sensitive random language model with short range interactions

Yuma Toji, Jun Takahashi, Vwani Roychowdhury, Hideyuki Miyahara · Apr 1, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Multimodal Language Models Cannot Spot Spatial Inconsistencies

Om Khangaonkar, Hadi J. Rad, Hamed Pirsiavash · Apr 1, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

We propose a simple and scalable method for generating realistic, spatially inconsistent image pairs from multi-view scenes, enabling systematic evaluation of this capability.
Our results show that state-of-the-art MLLMs significantly underperform human observers and exhibit substantial variability across different scene attributes, revealing a fragile and incomplete understanding of 3D structure.

Open paper

Reasoning-Driven Synthetic Data Generation and Evaluation

Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous · Mar 31, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative.
In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation.

Open paper

Convergent Representations of Linguistic Constructions in Human and Artificial Neural Systems

Pegah Ramezani, Thomas Kinfe, Andreas Maier, Achim Schilling, Patrick Krauss · Mar 31, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Fallback

Pairwise Preference General

The present study tests these predictions in human neural activity using electroencephalography (EEG).

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent