Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 77 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,610) General (530) Long Horizon (319) Pairwise Preference (287) Coding (216) Simulation Env (186) Multi Agent (182) Medicine (115) Llm As Judge (106) Expert Verification (97) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (77) Demonstrations (67) Critique Edit (63)

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Sher Badshah, Ali Emami, Hassan Sajjad · Feb 13, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Ready

Pairwise Preference Automatic Metrics General

Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.
To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to…

Open paper

From Intuition to Calibrated Judgment: A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text

Shinwoo Park, Yo-Sub Han · Jan 6, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Ready

Rubric Rating Automatic Metrics General

Distinguishing human-written Korean text from fluent LLM outputs remains difficult even for trained readers, who can over-trust surface well-formedness.
We present LREAD, a Korean-specific instantiation of a rubric-based expert-calibration framework for human attribution of LLM-generated text.

Open paper

Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Multi Agent MathCoding

Existing Multi-Agent Systems (MAS) typically rely on homogeneous model configurations, failing to exploit the diverse expertise inherent in different post-trained architectures.
Team-of-Thoughts introduces two novel components: (1) Orchestrator Calibration, which identifies models with superior coordination and synthesis capabilities, and (2) Agent Self-Assessment, a protocol where tool agents profile their own…

Open paper

PMG: Parameterized Motion Generator for Human-like Locomotion Control

Chenxi Han, Yuheng Min, Zihao Huang, Ao Hong, Hang Liu, Yi Cheng · Feb 13, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Long Horizon General

Recent advances in data-driven reinforcement learning and motion tracking have substantially improved humanoid locomotion, yet critical practical challenges remain.
To address these limitations, we propose the Parameterized Motion Generator (PMG), a real-time motion generator grounded in an analysis of human motion structure that synthesizes reference trajectories using only a compact set of parameteri

Open paper

IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning

Aayush Mishra, Daniel Khashabi, Anqi Liu · Sep 26, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% High protocol signal Freshness: Cold Status: Ready

Demonstrations Automatic Metrics General

Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and two model families.

Open paper

Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Fallback

Rubric RatingCritique Edit Coding

Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
We address this measurement gap with a tri-tier evaluation framework grounded in art-theoretical constructs (Section 2).

Open paper

LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning

Tiago Fernandes Tavares · Sep 26, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Fallback

Llm As JudgeAutomatic Metrics General

A qualitative audit by an independent LLM-as-a-judge confirms the discovery of meaningful functional axes, such as policy intent, that thematic ground-truth labels fail to capture.

Open paper

Structured Matrix Scaling for Multi-Class Calibration

Eugène Berta, David Holzmüller, Michael I. Jordan, Francis Bach · Nov 5, 2025

Citations: 0

Match reason: Title directly matches "Calibration".

Score: 68% Sparse protocol signal Freshness: Cold Status: Ready

Open paper

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu · Jun 18, 2025

Citations: 0

Match reason: Title directly matches "Calibration".

Score: 68% Sparse protocol signal Freshness: Cold Status: Ready

Open paper

PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification

Qiuming Luo, Yuebing Li, Feng Li, Chang Kong · Feb 8, 2026

Citations: 0