Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 10 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Match reason: Matches selected tags (Human Eval, Llm As Judge).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Demonstrations Human EvalLlm As Judge Long Horizon General
  • LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
  • We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation.
Open paper
HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong, Subhabrata Mukherjee · Jan 9, 2026

Citations: 0

Match reason: Matches selected tags (Human Eval, Llm As Judge).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise PreferenceRubric Rating Human EvalLlm As Judge General
  • Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
  • We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations.
Open paper

Match reason: Matches selected tags (Human Eval, Llm As Judge).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Human EvalLlm As Judge Coding
  • Gemini also serves as an LLM-as-a-judge system for automatic evaluation in our experiments.
  • The automated judgments were verified through human evaluation, demonstrating high agreement (kappa = 0.87).
Open paper
Citations: 0

Match reason: Matches selected tags (Human Eval, Llm As Judge).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Human EvalLlm As Judge General
  • Nine instruction-tuned low-parameterized SLMs are evaluated against three commercial LLMs using lexical and semantic similarity metrics alongside qualitative assessments, including human evaluation and LLM-as-a-judge methods.
Open paper
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut · Oct 21, 2025

Citations: 0

Match reason: Matches selected tags (Human Eval, Llm As Judge).

Score: 53% High protocol signal Freshness: Cold Status: Ready
Rubric Rating Human EvalLlm As Judge General
  • In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g.
  • We show that PoSh achieves stronger correlations (+0.05 Spearman ρ) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable…
Open paper
Citations: 0

Match reason: Matches selected tags (Human Eval, Llm As Judge).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Human EvalLlm As Judge Medicine
  • We present AgenticSum, an inference-time, agentic framework that separates context selection, generation, verification, and targeted correction to reduce hallucinated content.
  • We evaluate AgenticSum on two public datasets, using reference-based metrics, LLM-as-a-judge assessment, and human evaluation.
Open paper
LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models

Huimin Ren, Yan Liang, Baiqiao Su, Chaobo Sun, Hengtong Lu, Kaike Zhang · Nov 13, 2025

Citations: 0

Match reason: Matches selected tags (Human Eval, Llm As Judge).

Score: 50% Moderate protocol signal Freshness: Cold Status: Fallback
Human EvalLlm As Judge General
  • Current methods either rely on subjective and costly human evaluation or on automated LLM-as-a-judge systems, which suffer from inherent biases and unreliability.
  • To address these limitations, we introduce LexInstructEval, a new benchmark and evaluation framework for fine-grained lexical instruction following.
Open paper
VISTA: Verification In Sequential Turn-based Assessment

Ashley Lewis, Andrew Perrault, Eric Fosler-Lussier, Michael White · Oct 30, 2025

Citations: 0

Match reason: Matches selected tags (Human Eval, Llm As Judge).

Score: 50% Moderate protocol signal Freshness: Cold Status: Fallback
Human EvalLlm As Judge General
  • Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines.
  • Human evaluation confirms that VISTA's decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks.
Open paper
Estonian Native Large Language Model Benchmark

Helena Grete Lillepalu, Tanel Alumäe · Oct 24, 2025

Citations: 0

Match reason: Matches selected tags (Human Eval, Llm As Judge).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback
Human EvalLlm As Judge Multilingual
  • The availability of LLM benchmarks for the Estonian language is limited, and a comprehensive evaluation comparing the performance of different LLMs on Estonian tasks has yet to be conducted.
  • We introduce a new benchmark for evaluating LLMs in Estonian, based on seven diverse datasets.
Open paper
MATA: Mindful Assessment of the Telugu Abilities of Large Language Models

Chalamalasetti Kranti, Sowmya Vajjala · Aug 19, 2025

Citations: 0

Match reason: Matches selected tags (Human Eval, Llm As Judge).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback
Human EvalLlm As Judge General
  • In this paper, we introduce MATA, a novel evaluation dataset to assess the ability of Large Language Models (LLMs) in Telugu language, comprising 729 carefully curated multiple-choice and open-ended questions that span diverse linguistic…
  • Finally, we also compare LLM-as-a-judge evaluation with human evaluation for open-ended questions assess its reliability in a low-resource language.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.