Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 135 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (2,367) General (722) Long Horizon (460) Pairwise Preference (394) Coding (307) Simulation Env (263) Multi Agent (241) Medicine (151) Llm As Judge (146) Expert Verification (122) Math (117) Rubric Rating (115) Human Eval (113) Tool Use (101) Web Browsing (101) Red Team (93)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents
Jun 18, 2026 · Citations: 0

Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies.
StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs
Jun 18, 2026 · Citations: 0

Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood.
Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems
Jun 18, 2026 · Citations: 0

We propose H-RePlan, a hierarchical replanning framework for multi-device agents with unified API--CLI--GUI execution.
Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users
Jun 18, 2026 · Citations: 0

To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text.
Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology
Jun 18, 2026 · Citations: 0

On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs.
CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges
Jun 18, 2026 · Citations: 0

While LLMs represent a scalable solution for assisting humans in the generation of counterspeech for both threats, zero-shot models frequently generate repetitive and vague responses, underscoring the need for high-quality examples to steer…
Token-Operations-Oriented Inference Optimization Techniques for Large Models
Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback
Jun 18, 2026 · Citations: 0

PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric…
The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse
Jun 18, 2026 · Citations: 0

We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent.
Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families
Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia
Jun 18, 2026 · Citations: 0

The dataset is designed to support the evaluation of machine translation systems that aim to preserve document formatting during translation.
Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact
Jun 18, 2026 · Citations: 0

Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

PACE: A Proxy for Agentic Capability Evaluation

Yueqi Song, Lintang Sutawika, Jiarui Liu, Lindia Tjuatja, Jiayi Geng, Yunze Xiao · Jul 2, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Ready

Pairwise Preference Automatic Metrics Coding

We introduce PACE, a framework that constructs proxy benchmarks by selecting instances from existing non-agentic evaluations whose aggregate scores most reliably predict model performances on agentic benchmarks.
We apply PACE to the 4 target agentic benchmarks in this paper, which yields PACE-Bench, the concrete proxy benchmark that we evaluate in the paper.

Open paper

AIriskEval-edu: New Dataset for Risk Assessment in AI-mediated K-12 Educational Explanations

Javier Irigoyen, Roberto Daza, Francisco Jurado, Julian Fierrez, Ruben Tolosana, Alvaro Ortigosa · Jul 2, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Ready

Rubric Rating Automatic Metrics General

For each question, the dataset includes an explanation written by a human teacher alongside 11 explanations generated by LLM-simulated teacher profiles associated with distinct pedagogical risks.

Open paper

What Survives Into Context: A Diagnostic for Budget-Constrained Multi-Hop RAG and When Submodular Evidence Packing Improves It

Ananto Nayan Bala · Jul 1, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

SABER-Math: Automated Benchmark for Information Retrieval Evaluation in Mathematics

Nikolay Georgiev, Maria Drencheva, Kseniia Ibragimova, Ivo Petrov, Dimitar I. Dimitrov, Martin Vechev · Jun 29, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise Preference Automatic Metrics Math

As agentic AI systems tackle more complex mathematical tasks, they increasingly rely on information retrieval (IR) to search problem databases, theorem libraries, and educational resources.
Importantly, we show that general-purpose IR benchmarks such as MTEB do not reliably predict mathematical performance, especially for recent embedding models, highlighting the need for math-specific retrieval benchmarks.

Open paper

mamabench and mamaretrieval: Benchmarks for Evaluating Medical Retrieval-Augmented Generation in Maternal, Neonatal, and Reproductive Health

Yi Ren · Jun 28, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Ready

Rubric Rating Automatic Metrics Medicine

Medical question-answering benchmarks rarely cover the maternal, neonatal, child, and reproductive-health questions a nurse-midwife asks, and, to our knowledge, no public chunk-level relevance benchmark exists for maternal-health guideline…
We release two benchmarks that fill these gaps.

Open paper

Memory Makes the Difference: Evaluating How Different Memory Roles Shape Conversational Agents

Yuxin Wang, Paul Thomas, Zhiwei Yu, Yuan Gao, Saeed Hassanpour, Soroush Vosoughi · Jun 24, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise Preference Automatic Metrics General

Specifically, how they shape an agent's responses under varying conversational contexts and whether they lead to substantively different response behaviors.
We present a fine-grained taxonomy of conversational memory, classify retrieved memories into different role types, and design a user-centric evaluation framework that simulates user perspectives.

Open paper

AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression

Yijing Chen, Wenhui Tan, Xiaoyi Yu, Yuyue Wang, Xin Cheng, Kaisi Guan · Jun 23, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Experiments show that AVOC achieves state-of-the-art performance on long-form audio-video benchmarks, surpassing the second-best model by 4.9 and 5.5 points in average accuracy on OmniVideoBench and LVOmniBench, respectively.

Open paper

Improving Answer Extraction in Context-based Question Answering Systems Using LLMs

Hafez Abdelghaffar, Ahmed Alansary, Ali Hamdi · Jun 4, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Our methodology involves fine-tuning a pre-trained LLM on a benchmark QA dataset to improve its contextual comprehension and answer extraction capabilities.
Specifically, we utilize the Stanford Question Answering Dataset (SQuAD1.1), which provides high-quality context-question-answer triplets for supervised training and evaluation.

Open paper

DNA Language Models: An Assessment of Pre-Training for Fine-Tuning Tasks

Romain Karpinsky, Julien Mozziconacci, Mickaël Delcey · Jun 29, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

However, systematic benchmark comparisons across these methods remain scarce.

Open paper

PETRA: Transforming Web Text for Petroleum-Engineering Domain Adaptation

Kirill Dubovikov, Omar El Mansouri, Hachem Madmoun, Yanda Li, Sandeep Kumar, Aya El Mir · Jun 23, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Reranker adaptation improves the public Earth Science benchmark by 44% relative and a six-task reasoning-intensive panel by 23%.

Open paper

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

Darrien McKenzie, Nicklas Hansen, Xiaolong Wang · Jun 18, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance).

Open paper

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

Anzhe Xie, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai · Jun 15, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline.
Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth…

Open paper

Severity-Aware Curriculum Learning with Multi-Model Response Selection for Medical Text Generation

Ahmed Alansary, Molham Mohamed, Ali Hamdi · Jun 3, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Medicine

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

MASF: A Multi-Model Adaptive Selection Framework for Abstractive Text summarization

Ahmed Alansary, Ali Hamdi · Jun 3, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

The generated summaries are then evaluated using automatic evaluation metrics that capture both lexical similarity and semantic relevance.

Open paper

Telenor Nordics Customer Service self-help corpus

Mike Riess · May 26, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Multilingual

The documents have been sourced from the public self-help pages of four Nordic telecommunications operators and subsequently filtered for person-identifiable information and relevance through a combined LLM and human annotation pipeline.
Domain-specific datasets for Nordic languages remain scarce, particularly in customer service: a domain of growing importance for retrieval-augmented generation, cross-lingual transfer learning, and emerging agent-based service…

Open paper

SSDAU: Structured Semantic Data Augmentation for Joint Entity and Relation Extraction

Jiawei He, Mengyu Shi, Jiawei Liu, Dong Sun, Chunrong Fang, Xikai Yang · May 22, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

Sovandara Chhoun, Pichdara Po, Sereiwathna Ros, Wan-Sup Cho, Saksonita Khoeurn · May 21, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Multilingual

For evaluation, we perform 5-fold cross-validation over 18 question-answer pairs.

Open paper

A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering

Sereiwathna Ros, Phannet Pov, Ratanaktepi Chhor, Kimleang Ly, Wan-Sup Cho, Saksonita Khoeurn · May 21, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

We conduct a two-phase comparative evaluation.
First, we benchmark three embedding models: BGE-M3 (567M), Jina-Embeddings-v3 (570M), and Qwen3-Embedding (597M), for dense retrieval over Khmer documents.

Open paper

Information Dynamics of Language Communication

Leonardo S. Goodall, Andrea I. Luppi, Pedro A. M. Mediano · Jun 29, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready

Medicine

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Learning User-Aware Recall: Personalized Retrieval in Long-Term Conversational Memory

ZhiShu Jiang, Haibo Liu, Xin Shen, Guanqiang QI, Chenxi Miao, Weikang Li · May 28, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Fallback

Pairwise Preference General

Long-term conversational agents are expected to remember past interactions, but memory is useful only when the right evidence is recalled for the right user.
Existing memory-augmented LLM agents have made progress in building compact memory banks, yet retrieval is still often driven by query-centered similarity or fixed ranking rules, leaving user-conditioned relevance underexplored.

Open paper

Protocol Hubs

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives