Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 501 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (2,174) General (669) Long Horizon (424) Pairwise Preference (365) Coding (287) Simulation Env (248) Multi Agent (228) Medicine (143) Llm As Judge (134) Expert Verification (117) Human Eval (107) Math (107) Rubric Rating (102) Web Browsing (98) Tool Use (94) Red Team (85)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents
Jun 18, 2026 · Citations: 0

Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies.
StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs
Jun 18, 2026 · Citations: 0

Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood.
Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems
Jun 18, 2026 · Citations: 0

We propose H-RePlan, a hierarchical replanning framework for multi-device agents with unified API--CLI--GUI execution.
Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users
Jun 18, 2026 · Citations: 0

To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text.
Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology
Jun 18, 2026 · Citations: 0

On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs.
CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges
Jun 18, 2026 · Citations: 0

While LLMs represent a scalable solution for assisting humans in the generation of counterspeech for both threats, zero-shot models frequently generate repetitive and vague responses, underscoring the need for high-quality examples to steer…
Token-Operations-Oriented Inference Optimization Techniques for Large Models
Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback
Jun 18, 2026 · Citations: 0

PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric…
The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse
Jun 18, 2026 · Citations: 0

We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent.
Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families
Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia
Jun 18, 2026 · Citations: 0

The dataset is designed to support the evaluation of machine translation systems that aim to preserve document formatting during translation.
Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact
Jun 18, 2026 · Citations: 0

Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

No exact ID match for "2510.13117" yet. Showing current high-signal papers so you can continue browsing while this paper is indexed.

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

Han Jeon, Shiv Medler, Joseph Voyles, Matt Wood · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% High protocol signal Freshness: Hot Status: Ready

Red Team Llm As JudgeAutomatic Metrics General

Safety evaluation of LLM outputs has generally relied on LLM-based judges, which can be effective but are often slow and expensive to deploy at scale.
In this paper, we evaluate whether fine-tuned modern encoder classifiers from the ModernBERT family, including ModernBERT and Ettin, can reliably identify harmful LLM outputs in user-model conversations without substantial performance loss…

Open paper

Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets

Divake Kumar, Sina Tayebati, Devashri Naik, Amanda Sofie Rios, Nilesh Ahuja, Omesh Tickoo · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, calibration, miss-severity ranking, and spatial safety regions.
We present Argus, a cross-regime benchmark for post-hoc UQ in single-step executable GUI grounding: a 27-method open-weight matrix over 4 VLM agents and 4 datasets, plus an 8-method closed-source matrix across 3 frontier vendors where…

Open paper

How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

Yuxing Cheng, Yuan Wu, Yi Chang · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Math

To systematically study this problem, we introduce OCR-Robust, a benchmark designed for evaluating OCR reasoning robustness under visual perturbations.
We evaluate robustness using clean accuracy, Relative Corruption Retention (RCR), Worst-Case Retention (WCR), and a composite Corruption Robustness Index (CRI), and benchmark 18 models spanning proprietary systems, open-source VLMs, and…

Open paper

AI translation of literary texts is "fine", but readers still prefer human translations

Yves Ferstler, Adam Podoxin, Ty Brassington, Roman Grundkiewicz, Maite Taboada, Marzena Karpinska · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Human EvalLlm As Judge Multilingual

While the content may be rendered adequately, we do not know enough about how readers experience it in terms of immersiveness and literary effect, aspects poorly captured by automatic machine translation metrics or human evaluation…
We ask 15 avid readers to compare recently published human translations (HT) to machine translations (MT) generated with an agentic large language model (LLM)-based pipeline, for 15 recent novels in French, Polish, and Japanese and…

Open paper

Detect, Unlearn, Restore: Defending Text Summarization Models Against Data Poisoning

Poojitha Thota, Shirin Nilizadeh · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

In this setting, adversaries manipulate fine-tuning data to induce persistent summarization failures, such as biased or harmful summaries, while preserving standard evaluation metrics.
Across nine architectures and six benchmark datasets under adaptive attacks, our defenses achieve 85-92% detection precision, while gradient-ascent unlearning restores up to 96% of original behavior with minimal utility loss (less than 0.6%…

Open paper

Dziri Voicebot: An End-to-End Low-Resource Speech-to-Speech Conversational System for Algerian Dialect

Dihia Lanasri, Fairouz Taki, Asma Kemmoum · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Overview of HIPE-2026: Person-Place Relation Extraction from Multilingual Historical Texts

Juri Opitz, Maud Ehrmann, Corina Raclé, Andrianos Michail, Matteo Romanello, Simon Clematide · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Multilingual

Answering such questions from noisy, multilingual historical documents is the central challenge of HIPE-2026, the third edition of the HIPE evaluation series.
A distinctive feature of HIPE-2026 is its three-fold evaluation framework, which assesses predictive accuracy, computational efficiency, and cross-domain generalization, reflecting the practical demands of large-scale historical document…

Open paper

Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability

Yang Tian, Zhengpeng Shi, Bo Zhao · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% High protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Tool Use MedicineCoding

We introduce ToolBench-X, a benchmark for evaluating agents under recoverable reliability hazards.
These results suggest that tool-use evaluation should move beyond function-call accuracy toward task completion under unreliable tool environments.

Open paper

Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

Yupu Hao, Zhuoran Jin, Huanxuan Liao, Kang Liu, Jun Zhao · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready

Tool Use Coding

Tool use enables large language models (LLMs) to perform complex tasks, and recent agentic reinforcement learning (RL) methods show promise for enhancing model capabilities.
We find that interleaving supervised fine-tuning (SFT) with RL substantially improves stability, but exhibits degraded performance under format and content out-of-distribution (OOD) evaluation.

Open paper

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

Tianyu Dong, Yangyang Liu, Jiang Zhou, Xinwei Wu, Xiaohu Zhao, Hao Wang · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready

Multilingual

We conduct experiments on 2 LLMs across 5 low-resource languages and 3 benchmarks.

Open paper

Real-Time Voice AI Hears but Does Not Listen

Martijn Bartelds, Federico Bianchi, James Zou · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

Akshay Paruchuri, Sanmi Koyejo, Ehsan Adeli · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by emerging AI evaluation…

Open paper

When Certainty Is an Artifact: Keyword Lexicon Blindness and the (Mis)Measurement of Rhetorical Stance

Bo Chen · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining

Juliana Li, Diya Sreedhar · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar

Ilseyar Alimova, Bogdan Monogov, Artyom Mazur, Daniil Antonov, Vsevolod Karimov, Vitaliy Egorov · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

Multilingual

Text detoxification, the automated detection and mitigation of abusive and harmful content, is essential for ensuring the safety of online communities and protecting users.
We also introduce a new dataset for text detoxification in Tatar, designed for fine tuning and evaluation in low resource settings.

Open paper

Autodata: An agentic data scientist to create high quality synthetic data

Ilia Kulikov, Chenxi Whitehouse, Tianhao Wu, Yixin Nie, Swarnadeep Saha, Eryk Helenowski · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

MathLaw

We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data.
We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data.

Open paper

SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models

Liang-Yuan Wu, Zih-Ching Chen, Tongshuang Wu, Chao-Han Huck Yang, Hua Shen · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

As multimodal conversational systems increasingly engage in spoken interaction, their ability to navigate paralinguistic social cues has become a critical bottleneck for natural human-AI communication.
However, existing evaluations of machine emotional intelligence assess reasoning exclusively through isolated text or passive acoustic perception, overlooking the complex cross-modal reasoning required for active, multi-turn dialogue.

Open paper

Weave of Formal Thought

Alexandre Bouayad · Jun 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

Coding