Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 501 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,975) General (610) Long Horizon (391) Pairwise Preference (328) Coding (259) Simulation Env (226) Multi Agent (213) Medicine (129) Llm As Judge (121) Expert Verification (110) Human Eval (100) Math (98) Rubric Rating (94) Web Browsing (91) Demonstrations (80) Tool Use (77)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Tokenisation via Convex Relaxations
May 21, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Vector Policy Optimization: Training for Diversity Improves Test-Time Search
May 21, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Evaluating Commercial AI Chatbots as News Intermediaries
May 21, 2026 · Citations: 0

We present a 14-day (February 9-22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5 and GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional…
Reducing Political Manipulation with Consistency Training
May 21, 2026 · Citations: 0

We show that PCT preserves overall helpfulness, substantially reduces covert political bias, and generalizes to held-out benchmarks.
Understanding Data Temporality Impact on Large Language Models Pre-training
May 21, 2026 · Citations: 0

First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods.
ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning
May 21, 2026 · Citations: 0

The graph is constructed through a disease-autonomous multi-agent pipeline in which multiple frontier LLMs independently extract knowledge from PubMed and PMC literature.
Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models
May 21, 2026 · Citations: 0

We investigate whether acoustic emotion recognition models can serve as proxies for the Pathos dimension in political speech analysis, as operationalised by the TRUST multi-agent large language model (LLM) pipeline.
AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild
May 21, 2026 · Citations: 0

As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild.
AMEL: Accumulated Message Effects on LLM Judgments
May 21, 2026 · Citations: 0

Across 75,898 API calls to 11 models from 4 providers (OpenAI, Anthropic, Google, and four open-source models), we present identical test items in isolation or following histories saturated with predominantly positive or negative…
Tokenization with Split Trees
May 21, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Self-Policy Distillation via Capability-Selective Subspace Projection
May 21, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Moral Semantics Survive Machine Translation: Cross-Lingual Evidence from Moral Foundations Corpora
May 21, 2026 · Citations: 0

Using \sim50k morally-annotated social media posts from a diverse range of topics, we apply a principled four-method validation pipeline: LaBSE cross-lingual embedding similarity, Centered Kernel Alignment (CKA), LLM-as-judge evaluation,…

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Scene Abstraction for Lexical Semantics: Structured Representations of Situated Meaning

Yejin Cho, Katrin Erk · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Automatic Metrics General

Our contributions are three-fold: (1) a structured representation framework for situated lexical meaning; (2) COCA-Scenes, a dataset of 520 usage instances across 26 keywords for distinct scene identification; and (3) empirical evidence…

Open paper

Cohesion-6K: An Arabic Dataset for Analyzing Social Cohesion and Conflict in Online Discourse

Aisha Ali Al-Athba, Wajdi Zaghouani · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Coding

The annotation process combines expert human judgment with model-assisted pre-labeling verified by trained annotators, achieving substantial inter-annotator agreement (Cohens kappa = 0.85).

Open paper

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

Jianing Yin, Tan Tang · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Large language model (LLM) agents still struggle with long-term memory question answering, where answer-supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content.

Open paper

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

Xiaolong Zhou, Yifei Liu, Ziyang Gong, Jiarui Li, Qiyue Zhao, Muyao Niu · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Simulation Env General

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world…
Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of…

Open paper

BeLink: Biomedical Entity Linking Meets Generative Re-Ranking

Darya Shlyk, Stefano Montanelli, Lawrence Hunter · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Medicine

Our method demonstrates strong performance on multiple BEL benchmarks, yielding significant improvements in linking accuracy (3%-24%) while reducing inference time compared to the state-of-the-art.

Open paper

Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity

Hangyue Zhao, Paul Caillon, Erwan Fagnou, Alexandre Allauzen · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Recent task-specific attention operators can compress deep Transformer stacks into a few layers by performing multi-hop state propagation within a single layer, but their dense evaluation remains expensive.
On controlled tracking benchmarks, our method matches the dense operator's accuracy while reducing wall-clock time by 12-29\% under a standardized measurement protocol, and is up to 2.4 \times faster than a compact dense Transformer at…

Open paper

From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models

Caleb Munigety · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Two NLA-inspired evaluations strengthen this picture: the fifteen selective features explain only 31% of activation variance versus the SAE's 99.7%, and selectivity ratio anticorrelates with causal force (r = -0.56).
A cost-based deployment evaluation (assumed 50/FN, 0.42/FP, 2% error rate) finds an optimal monitor configuration yielding 8.96 per 1000 queries against a 1000 baseline, a 99.1% saving.

Open paper

LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance

Yuchun Fan, Bei Li, Peiguang Li, Yilin Wang, Yongyu Mu, Jian Yang · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready

Long Horizon MathMultilingual

Empirical results on challenging multilingual mathematical benchmarks reveal that LANG substantially enhances reasoning performance without compromising language consistency.

Open paper

Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

Zihan Liang, Yufei Ma, Ben Chen, Zhipeng Qian, Xuxin Zhang, Huangyu Dai · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready

Long Horizon Coding

Post-training has become the dominant recipe for turning a language model into a competent search-augmented reasoning agent.
We take a step back and ask whether any of this machinery is actually necessary, and propose Search-E1, a self-evolution method that lets a search-augmented agent improve through only vanilla GRPO interleaved with offline self-distillation…

Open paper

In Silico Modeling of the RAMPHO Buffer: Dissociating Informational and Energetic Masking via Phonetic Entropy in Deep Neural Networks

Stefan Bleeck · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready

Simulation Env General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

A Tutorial on Diffusion Theory: From Differential Equations to Diffusion Models

Jiayi Fu, Yuxia Wang · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

Math

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion

Meimingwei Li, Yuanhao Ding, Esteban Garces Arias, Christian Heumann · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

Shuaiqi Wang, Aadyaa Maddi, Zinan Lin, Giulia Fanti · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

Coding

We introduce SynAE, an evaluation framework for assessing how well synthetic benchmarks for multi-turn, tool-calling agents replicate and augment the characteristics of real data trajectories.
We evaluate SynAE using recent agent benchmarks and test common synthetic data failure modes via realistic and controlled generation schemes.

Open paper

One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

Yevhen Kostiuk, Kenneth Enevoldsen · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

The single-point evaluation ignores a main problem of the instruction-based approach namely: sensitivity to the phrasing of the instruction.
Our findings suggest that single-prompt evaluation is insufficient for instruction-tuned embedding models and that benchmarks should incorporate prompt robustness, either by evaluating over multiple prompts or by reporting sensitivity…

Open paper

Reflecti-Mate: A Conversational Agent for Adaptive Decision-Making Support Through System 1 and System 2 Thinking

Morita Tarvirdians, Senthil Chandrasegaran, Hayley Hung, Catholijn M. Jonker, Catharine Oertel · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

In this study, we investigate an agent designed to encourage integration by adapting to the individual user's thought patterns.
We explore its effects on participants' perceptions of the agent and their reflective behavior, in comparison with unaided pre-reflection and a baseline agent.

Open paper

Polite on the Surface, Wrong in Practice: A Curated Dataset for Fixing Honorific Failures in Multilingual Bangla Generation

Md. Asaduzzaman Shuvo, Mahedi Hasan, Md. Tashin Parvez, Azizul Haque Noman, Md. Shafayet Hossain Ovi · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

CodingMultilingual

To address this limitation, we introduce a novel, culturally aligned instruction-tuning dataset for BangLa Application and DialoguE generation - BLADE and benchmarking framework comprising 4,196 meticulously curated interaction pairs.
Our empirical evaluations demonstrate that models fine-tuned on our dataset yield substantial improvements in structural fidelity and honorific alignment, providing a rigorous benchmark for bridging pragmatic disparities in low-resource…

Open paper

Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation

Genoveffa Martone, Helena Bonaldi, Marco Guerini · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

23 experts revise the generated CS, which are assessed via human and automatic metrics.
Based on the post-edited CS, the mixed strategy proves to be the most effective in crowdsourcing evaluation, pairing strong factual correction with stereotype mitigation and empathetic engagement.

Open paper

Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings

Jakub Radzikowski, Josef Chen · May 21, 2026

Citations: 0

Match reason: Ranked by recency.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

Multilingual