Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 4,821 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,846) General (580) Long Horizon (358) Pairwise Preference (313) Coding (243) Simulation Env (206) Multi Agent (205) Medicine (125) Llm As Judge (120) Expert Verification (104) Human Eval (97) Rubric Rating (91) Math (88) Web Browsing (84) Demonstrations (74) Red Team (70)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Implicit Representations of Grammaticality in Language Models
May 6, 2026 · Citations: 0

Grammaticality and likelihood are distinct notions in human language.
MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge
May 6, 2026 · Citations: 0

Background: Existing MRI LLM benchmarks rely mainly on review-book multiple-choice questions, where top proprietary models already score highly, limiting discrimination.
The First Token Knows: Single-Decode Confidence for Hallucination Detection
May 6, 2026 · Citations: 0

Across three 7-8B instruction-tuned models and two benchmarks, phi_first achieves a mean AUROC of 0.820, compared with 0.793 for semantic agreement and 0.791 for standard surface-form self-consistency.
PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation
May 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Semantics: An Evidential Reasoning-Aware Multi-View Learning Framework for Trustworthy Mental Health Prediction
May 6, 2026 · Citations: 0

Benchmarking on three real-world datasets, Dreaddit, SDCNL, and DepSeverity, reports accuracies of 0.835, 0.731, and 0.751, respectively, demonstrating its potential for reliable mental health prediction.
Text Corpora as Concept Fields: Black-Box Hallucination and Novelty Measurement
May 6, 2026 · Citations: 0

Concept Fields provide a fast, lightweight, and interpretable signal for groundedness and novelty, complementary to LLM-as-judge and white-box detectors.
Continual Knowledge Updating in LLM Systems: Learning Through Multi-Timescale Memory Dynamics
May 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models
May 6, 2026 · Citations: 0

We present an automated, contrastive evaluation pipeline for auditing the behavioral impact of interventions on large language models.
The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences
May 6, 2026 · Citations: 0

To test this hypothesis at the item level, we introduce the Pinocchio score (π_i), the ratio of inter-model response variance under neutral prompting to that under a human-simulation prompt, as an annotation-free measure of each item's…
The Impossibility Triangle of Long-Context Modeling
May 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
May 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals
May 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

No exact ID match for "2604.17609" yet. Showing current high-signal papers so you can continue browsing while this paper is indexed.

Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

Qianjia Cheng, Yuchen Zhang, Zhilin Wang, Yuxin Zuo, Shunkai Zhang, Yuchen Fan · May 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls.
When applied to Qwen3 thinking models at 4B and 30B scales, our recipe yields models that achieve state-of-the-art performance in a wide range of benchmarks among open-source models, such as 96.7% and 99.2% on AIME 2025 for 4B and 30B,…

Open paper

Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs

Elizabeth Mieczkowski, Alexander Ku, Tiwalayo Eisape, Dilip Arumugam, John Matters, Katherine M. Collins · May 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Red Team Automatic Metrics General

In contrast, fully unstructured teams enable adaptability and exploration but suffer from inefficiencies such as error propagation, inter-agent conflicts, and wasted resources (measured in time, tokens, or file operations).
We introduce Language Agent Teams for Task Evolution (LATTE), a framework for coordinating LLM teams inspired by distributed systems, where processors must operate under partial observability and communication constraints.

Open paper

Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text

Tom Kempton, Viktor Drobnyi, Maeve Madigan, Stuart Burrell · May 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Medicine

The ability to reliably distinguish human-written text from that generated by large language models is of profound societal importance.
However, we demonstrate that the token-level signal distinguishing human and machine text is non-uniform across the hidden space of the detector model, and naively averaging likelihood-based token scores across regions with fundamentally…

Open paper

Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement

Jessica Huynh, Alfredo Gomez, Athiya Deviyani, Renee Shelby, Jeffrey P. Bigham, Fernando Diaz · May 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Rubric Rating Llm As JudgeAutomatic Metrics General

Autoraters, also referred to as LLM-as-judges, are increasingly used for evaluation and automated content moderation.
While these rubrics can be edited to improve the individual accuracy of both human and automated scoring, this approach may result in disagreement between the two scores, or with the associated holistic judgment.

Open paper

Linear Semantic Segmentation for Low-Resource Spoken Dialects

Kirill Chirkunov, Younes Samih, Abed Alhakim Freihat, Hanan Aldarmaki · May 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Coding

In this paper, we introduce a new multi-genre benchmark (more than 1000 samples) for semantic segmentation in conversational Arabic, focusing on dialectal discourse.
Using this benchmark, we show that segmentation models performing well on MSA news genres degrade on dialectal transcribed speech.

Open paper

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

Ömer Faruk Akgül, Rajgopal Kannan, Willie Neiswanger, Viktor Prasanna · May 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Math

Across three model families, six scales, and six math reasoning benchmarks, ReasonMaxxer matches or exceeds full RL performance while requiring only tens of problems and minutes of single-GPU training, a reduction in training cost of…

Open paper

OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models

Jaehoon Kim, Dongha Lee · May 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Math

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

Yijia Zheng, Marcel Worring · May 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Long Horizon General

Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process, in which the large language model (LLM) acts as a search agent that generates intermediate thoughts and subqueries to iteratively interact with…
Extensive experiments on seven benchmark datasets show that LatentRAG achieves performance comparable to explicit agentic RAG methods while reducing inference latency by approximately 90%, substantially narrowing the latency gap with…

Open paper

MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

Ashwani Anand, Ivi Chatzi, Ritam Raha, Anne-Kathrin Schmuck · May 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready

Long Horizon General

To overcome these limitations, we present MANTRA, a framework for automatically synthesizing machine-checkable compliance benchmarks from natural-language manuals and tool schemas.
Empirically, we show that the compliance checks are richer with stronger constraint enforcement compared to existing benchmarks.

Open paper

YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling

Fengze Guo, Yue Chang · May 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready

Multilingual

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li, Chengming Li · May 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready

Long Horizon General

Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions.
In this paper, we propose A^2TGPO (Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i) turn-group…

Open paper

Don't Lose Focus: Activation Steering via Key-Orthogonal Projections

Haoyan Luo, Mateo Espinosa Zarlenga, Mateja Jamnik · May 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

Across multiple steering benchmarks, we show that SKOP achieves the best joint steering-utility trade-off, reducing utility degradation by 5-7x while retaining over 95% of vanilla steering efficacy.

Open paper

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

Florian A. D. Burnat, Brittany I. Davidson · May 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation.
We define evaluation-context divergence as an observable within-item change in behavior induced by framing a fixed task as an evaluation, a live deployment interaction, or a neutral request, and present a paired-prompt protocol that…

Open paper

Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation

Maximilian Maurer, Maximilian Linde, Gabriella Lapesa · May 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

Human label variation has been established as a central phenomenon in NLP: the perspectives different annotators have on the same item need to be embraced.
We present the first large-scale analysis of four reference datasets for harmful language detection, bringing together annotator characteristics, linguistic properties of the items, and their interactions in a statistically informed…

Open paper

MultiLinguahah : A New Unsupervised Multilingual Acoustic Laughter Segmentation Method

Callejas Sofia, Gomez Nahuel, Pelachaud Catherine, Ravenet Brian, Barriere Valentin · May 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

Multilingual

Laughter is a social non-vocalization that is universal across cultures and languages, and is crucial for human communication, including social bonding and communication signaling.

Open paper

UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

Qihang Fan, Huaibo Huang, Zhiying Wu, Bingning Wang, Ran He · May 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General