Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 3 Search mode: hybrid Shortlist (0) RSS

Filter by tag

All Automatic Metrics (620) General (224) Long Horizon (124) Pairwise Preference (112) Coding (91) Simulation Env (83) Multi Agent (58) Medicine (40) Expert Verification (39) Llm As Judge (36) Web Browsing (32) Demonstrations (29) Human Eval (29) Red Team (29) Rubric Rating (29) Math (28)

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

No exact ID match for "2603.04964". Showing closest results for "Replaying pre training data improves fine tuning" instead.

Same Words, Different Judgments: Modality Effects on Preference Alignment

Aaron Broukhim, Nadir Weibel, Eshin Jolly · Feb 26, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% High protocol signal Freshness: Hot Status: Ready

Pairwise PreferenceRlaif Or Synthetic Feedback Automatic Metrics General

Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences, but its application to speech remains underexplored.
We present a controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts.

Open paper

Replaying pre-training data improves fine-tuning

Suhas Kotha, Percy Liang · Mar 5, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Web Browsing Math

We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by 4.5\% and Basque question-answering accuracy by 2\%.

Open paper

Diverging Preferences: When do Annotators Disagree and do Models Know?

Michael JQ Zhang, Zhilin Wang, Jena D. Hwang, Yi Dong, Olivier Delalleau, Yejin Choi · Oct 18, 2024

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready

Pairwise Preference Llm As Judge General

In our experiments, we demonstrate how standard reward modeling (e.g., Bradley-Terry) and LLM-as-Judge evaluation methods fail to account for divergence between annotators.
To address these issues, we develop methods for identifying diverging preferences to mitigate their influence in evaluations and during LLM training.

Open paper