Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 84 Search mode: keyword Ranking: eval-signal prioritized Shortlist (0) RSS

Filter by tag

All Automatic Metrics (2,367) General (722) Long Horizon (460) Pairwise Preference (394) Coding (307) Simulation Env (263) Multi Agent (241) Medicine (151) Llm As Judge (146) Expert Verification (122) Math (117) Rubric Rating (115) Human Eval (113) Tool Use (101) Web Browsing (101) Red Team (93)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents
Jun 18, 2026 · Citations: 0

Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies.
StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs
Jun 18, 2026 · Citations: 0

Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood.
Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems
Jun 18, 2026 · Citations: 0

We propose H-RePlan, a hierarchical replanning framework for multi-device agents with unified API--CLI--GUI execution.
Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users
Jun 18, 2026 · Citations: 0

To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text.
Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology
Jun 18, 2026 · Citations: 0

On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs.
CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges
Jun 18, 2026 · Citations: 0

While LLMs represent a scalable solution for assisting humans in the generation of counterspeech for both threats, zero-shot models frequently generate repetitive and vague responses, underscoring the need for high-quality examples to steer…
Token-Operations-Oriented Inference Optimization Techniques for Large Models
Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback
Jun 18, 2026 · Citations: 0

PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric…
The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse
Jun 18, 2026 · Citations: 0

We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent.
Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families
Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia
Jun 18, 2026 · Citations: 0

The dataset is designed to support the evaluation of machine translation systems that aim to preserve document formatting during translation.
Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact
Jun 18, 2026 · Citations: 0

Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

PACE: A Proxy for Agentic Capability Evaluation

Yueqi Song, Lintang Sutawika, Jiarui Liu, Lindia Tjuatja, Jiayi Geng, Yunze Xiao · Jul 2, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 89% High protocol signal Freshness: Hot Status: Ready

Pairwise Preference Automatic Metrics Coding

We introduce PACE, a framework that constructs proxy benchmarks by selecting instances from existing non-agentic evaluations whose aggregate scores most reliably predict model performances on agentic benchmarks.
We apply PACE to the 4 target agentic benchmarks in this paper, which yields PACE-Bench, the concrete proxy benchmark that we evaluate in the paper.

Open paper

An LLM-Based Framework for Intent-Driven Network Topology Design

Kholoud El-Habbouli, Fen Zhou, Stephane Huet · Jul 1, 2026

Citations: 0

Match reason: Keyword overlap 4/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Overall, this work provides a systematic benchmark for understanding how LLMs handle structural and resilience constraints in topology synthesis, and supports informed model selection for AI-driven network design.

Open paper

PARTREP: Learning What to Repeat for Decoder-only LLMs

Andikawati P Widjaja, Yongjun Kim, Hyounghun Kim, Jaeho Lee · Jul 2, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 84% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Math

Across eight benchmarks (including MMLU, GSM8K, and RULER) and three model families (Qwen2.5, Llama3.2, Gemma4), PartRep retains most of the gains of full repetition while using only 59.4\% of its KV cache and 79.0\% of its prefill FLOPs.

Open paper

Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets

Divake Kumar, Sina Tayebati, Devashri Naik, Amanda Sofie Rios, Nilesh Ahuja, Omesh Tickoo · Jun 24, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 84% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, calibration, miss-severity ranking, and spatial safety regions.
We present Argus, a cross-regime benchmark for post-hoc UQ in single-step executable GUI grounding: a 27-method open-weight matrix over 4 VLM agents and 4 datasets, plus an 8-method closed-source matrix across 3 frontier vendors where…

Open paper

SWAN: Semantic Watermarking with Abstract Meaning Representation

Ziping Ye, Gourab Dey, Christos Christodoulopoulos, Charith Peris, Anil Ramakrishna, Weitong Ruan · May 5, 2026

Citations: 0

Match reason: Keyword overlap 4/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 88% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise Preference Automatic Metrics General

In contrast to existing watermarking methods, which typically encode signatures by adjusting token selection preferences during text generation, SWAN embeds the signature directly in the sentence's semantic representation.
Empirical evaluation on the RealNews benchmark shows SWAN matches state-of-the-art detection performance on unaltered watermarked text, while significantly improving robustness against paraphrasing, increasing detection AUC by up to 13.9…

Open paper

How Far Can You Get Without a GPU? A Systematic Benchmark of Lightweight Hallucination Detection Across Question Answering, Dialogue, and Summarisation

Kriti Faujdar, Smit Kadvani · Jun 29, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 89% High protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Tool Use General

We systematically benchmark five such methods: ROUGE-L, semantic similarity, BERTScore, a Natural Language Inference (NLI) detector based on a FEVER-trained DeBERTa model, and a score-level ensemble of similarity and NLI.
We evaluate them across all three tasks of the HaluEval benchmark: question answering (QA), dialogue, and summarisation.

Open paper

Labeling Training Data for Entity Matching Using Large Language Models

Aaron Steiner, Christian Bizer · Jun 27, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 79% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

We evaluate the workflows using the Abt-Buy, Walmart-Amazon, WDC Products, DBLP-ACM, and DBLP-Scholar benchmarks, and compare the performance of student models trained with machine-labeled data to the performance of the same models trained…
Our experiments show that student models trained using the machine-labeled sets perform approximately on par with models trained on the benchmark training sets, with the remaining differences in both directions staying below two F1 points.

Open paper

When More Sampling Hurts: The Modal Ceiling and Correlation Ceiling of Test-Time Scaling

Yong Yi Bay, Kathleen A. Yearick · Jun 27, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 79% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

For picking an answer, the vote has already settled within a few dozen draws, the modal ceiling; for scoring a benchmark, sooner still, the correlation ceiling.

Open paper

SEAD: Competence-Aware On-Policy Distillation via Entropy-Guided Supervision

Chia-Hsuan Lee, Zelei Cheng, Yu Wang, Renkun Ni, Sambit Sahu, Shi-Xiong Zhang · Jun 26, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 79% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Math

On OLMo-3 (7B to 32B), SEAD achieves +4.8 avg accuracy over vanilla OPD across six math benchmarks, with ablations confirming super-additive interactions.

Open paper

The Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms

Jinghan Zhang, Zerui Cheng, Shiqi Chen, Ge Zhang, Wenhao Huang, Jiashuo Liu · Jun 24, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 79% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Coding

Traditional evaluations measure a learning algorithm's final performance on an i.i.d.
We introduce the Generalization Spectrum, an evaluation framework designed to expose this hidden dimension.

Open paper

Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

Ali Şenol, Garima Agrawal, Huan Liu · May 23, 2026

Citations: 0

Match reason: Keyword overlap 4/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Despite remarkable progress on reasoning benchmarks, current LLM evaluation practice remains anchored to final-answer correctness, providing limited insight into how models reason, how reliably they behave under contextual variation, or how…
Experiments across multiple LLMs and benchmarks reveal behaviors systematically concealed by single-metric evaluation, including the orthogonality of local logical coherence and correctness, deployment-context-dependent ranking inversions,…

Open paper

BeLink: Biomedical Entity Linking Meets Generative Re-Ranking

Darya Shlyk, Stefano Montanelli, Lawrence Hunter · May 21, 2026

Citations: 0

Match reason: Keyword overlap 4/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Medicine

Our method demonstrates strong performance on multiple BEL benchmarks, yielding significant improvements in linking accuracy (3%-24%) while reducing inference time compared to the state-of-the-art.

Open paper

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

Ömer Faruk Akgül, Rajgopal Kannan, Willie Neiswanger, Viktor Prasanna · May 7, 2026

Citations: 0

Match reason: Keyword overlap 4/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Math

Across three model families, six scales, and six math reasoning benchmarks, ReasonMaxxer matches or exceeds full RL performance while requiring only tens of problems and minutes of single-GPU training, a reduction in training cost of…

Open paper

Diagnosing and Mitigating Context Rot in Long-horizon Search

Shijie Xia, Yikun Wang, Zhen Huang, Pengfei Liu · Jun 29, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 84% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Long Horizon General

By evaluating four flagship open-source models across three benchmarks, we reveal a prevalent but unnoticed rot phenomenon: extensive context causes models to directly give up or prematurely provide uncertain answers, and this issue is…

Open paper

Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

Saksham Sahai Srivastava · May 17, 2026

Citations: 0

Match reason: Keyword overlap 4/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 88% Moderate protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Long Horizon Coding

Long-horizon LLM agents rely on persistent memory to support interactions across sessions, yet existing memory systems often retrieve context using semantic similarity or broad history inclusion, treating retrieved memories as uniformly…
To evaluate this setting, we introduce Causal-LoCoMo, a causally annotated benchmark derived from long conversational data, where each example contains a user request, a structured memory bank, useful memories, irrelevant distractors, and…

Open paper

RASC+: Retrieval-Constrained LLM Adjudication for Clinical Value Set Authoring

Sumit Mukherjee · Jun 22, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 77% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics MedicineCoding

The recently introduced Retrieval-Augmented Set Completion (RASC) benchmark showed that direct zero-shot large language model (LLM) generation is poorly suited to this task: clinical code systems are large, version-controlled, and not…
These results show that retrieval-constrained LLM adjudication can substantially improve value set completion while preserving the safety constraint that all returned codes must come from an auditable candidate pool.

Open paper

Latent Performance Profiling of Large Language Models

Tanmoy Chakraborty, Ayan Sengupta, Suparna Bhattacharya, Partha Pratim Chakrabarti, Amlan Chakrabarti, Supratik Chakraborty · May 28, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 77% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities.
With extensive empirical analyses across eight LLMs, spanning a size range of 0.5B-14B, we demonstrate that models with similar benchmark scores can exhibit contrasting latent profiles, such as differences in entropy or adaptability.

Open paper

Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study

Volodymyr Ovcharov · May 14, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 77% Moderate protocol signal Freshness: Warm Status: Ready

Demonstrations Automatic Metrics Law

We benchmark seven models from five providers on 273 validated court decisions from Ukraine's state registry (EDRSR), measuring tokenizer fertility and zero-shot performance on three tasks.
To support reproducibility and address the absence of Ukrainian from legal NLP benchmarks, we release a public dataset of 14,452 court decisions spanning 2008-2026, annotated with seven outcome labels across three temporal epochs that…

Open paper

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

Anzhe Xie, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai · Jun 15, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 72% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline.
Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth…

Open paper

StoryAlign: Evaluating and Training Reward Models for Story Generation

Haotian Xia, Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou · May 6, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 66% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise Preference Automatic Metrics Coding

Although large language models (LLMs) have significantly advanced text generation, stories generated by LLMs still diverge from human-authored works regarding complex narrative structure and human-aligned preferences.
We find existing reward models struggle to select human-preferred stories, with the best model achieving only 66.3\% accuracy.

Open paper

Protocol Hubs

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives