Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 83 Search mode: keyword Ranking: eval-signal prioritized Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,620) General (530) Long Horizon (320) Pairwise Preference (288) Coding (218) Simulation Env (187) Multi Agent (182) Medicine (116) Llm As Judge (107) Expert Verification (97) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (77) Demonstrations (67) Critique Edit (63)

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary · Oct 5, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 88% High protocol signal Freshness: Cold Status: Ready

Rubric Rating Automatic MetricsSimulation Env Coding

We present a principled Bayesian evaluation framework that replaces Pass@k and average accuracy over N trials (avg@N) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and…
Together, these results recommend replacing Pass@k for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit.

Open paper

APEX-Agents

Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein · Jan 20, 2026

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 78% High protocol signal Freshness: Warm Status: Ready

Rubric RatingExpert Verification Automatic Metrics Long Horizon Law

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate…
We test eight agents for the leaderboard using Pass@1.

Open paper

From Intuition to Calibrated Judgment: A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text

Shinwoo Park, Yo-Sub Han · Jan 6, 2026

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 78% High protocol signal Freshness: Warm Status: Ready

Rubric Rating Automatic Metrics General

Distinguishing human-written Korean text from fluent LLM outputs remains difficult even for trained readers, who can over-trust surface well-formedness.
We present LREAD, a Korean-specific instantiation of a rubric-based expert-calibration framework for human attribution of LLM-generated text.

Open paper

Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms

Nurul Aisyah, Muhammad Dehan Al Kautsar, Arif Hidayat, Raqib Chowdhury, Fajri Koto · Jun 5, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Cold Status: Ready

Rubric Rating Automatic Metrics Math

Assessment tasks include grading and generating personalized Indonesian feedback guided by rubric-based evaluation.

Open paper

MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi, Janice Lam · Sep 30, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 88% High protocol signal Freshness: Cold Status: Fallback

Pairwise PreferenceRubric Rating Automatic Metrics Multilingual

To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms.
Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain.

Open paper

Distilling Feedback into Memory-as-a-Tool

Víctor Gallego · Jan 9, 2026

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 73% Moderate protocol signal Freshness: Warm Status: Ready

Rubric RatingCritique Edit Automatic Metrics General

We propose a framework that amortizes the cost of inference-time reasoning by converting transient critiques into retrievable guidelines, through a file-based memory system and agent-controlled tool calls.

Open paper

HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong, Subhabrata Mukherjee · Jan 9, 2026

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 73% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise PreferenceRubric Rating Human EvalLlm As Judge General

Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations.

Open paper

From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark

Jinning Zhang, Jie Song, Wenhui Tu, Zecheng Li, Jingxuan Li, Jin Li · Jan 1, 2026

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 73% Moderate protocol signal Freshness: Warm Status: Ready

Rubric RatingExpert Verification Automatic Metrics Medicine

Validated in sports rehabilitation, we release a knowledge graph (357,844 nodes, 371,226 edges) and a benchmark of 1,637 QA pairs.
Five expert clinicians rated the system 4.66--4.84 on a 5-point Likert scale, and system rankings are preserved on a human-verified gold subset (n=80).

Open paper

PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut · Oct 21, 2025

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 73% High protocol signal Freshness: Cold Status: Ready

Rubric Rating Human EvalLlm As Judge General

In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g.
We show that PoSh achieves stronger correlations (+0.05 Spearman ρ) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable…

Open paper

Role-Augmented Intent-Driven Generative Search Engine Optimization

Xiaolu Chen, Haojie Wu, Jie Bao, Zhen Chen, Yong Liao, Hu Huang · Aug 15, 2025

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 73% High protocol signal Freshness: Cold Status: Ready

Rubric Rating Automatic Metrics Web Browsing General

To better evaluate the method under realistic settings, we address the benchmarking limitations of prior work by: (1) extending the GEO dataset with diversified query variations reflecting real-world search scenarios and (2) introducing…

Open paper

Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Fallback

Rubric RatingCritique Edit Coding

Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
We address this measurement gap with a tri-tier evaluation framework grounded in art-theoretical constructs (Section 2).

Open paper

PrefDisco: Benchmarking Proactive Personalized Reasoning

Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh, Maryam Fazel · Sep 30, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Cold Status: Fallback

Pairwise PreferenceRubric Rating Automatic Metrics General

We introduce PrefDisco, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse, context-dependent preferences, and define PrefAlign as a…
PrefDisco builds scenarios where identical questions require different reasoning chains depending on user context, as optimal explanation approaches vary by individual expertise and preferences while maintaining factual accuracy.

Open paper

DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation

Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song · Dec 19, 2025

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 68% Moderate protocol signal Freshness: Warm Status: Ready

Rubric RatingExpert Verification Long Horizon General

However, evaluating such reports remains challenging: report quality is multifaceted, making it difficult to determine what to assess and by what criteria; LLM-based judges may miss errors that require domain expertise to identify; and…
To address these issues, we propose DEER, a benchmark for evaluating expert-level deep research reports.

Open paper

Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi · Oct 23, 2025

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 68% Moderate protocol signal Freshness: Cold Status: Ready

Rubric Rating Human EvalAutomatic Metrics Coding

Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters.
Our results show that ChatGPT-based coding perform consistently in the same way as human raters across gender or racial/ethnic groups, demonstrating the possibility of its use in large-scale assessments of collaboration and communication.

Open paper

Augmenting Rating-Scale Measures with Text-Derived Items Using the Information-Determined Scoring (IDS) Framework

Joe Watson, Ivan O'Connor, Chia-Wen Chen, Luning Sun, Fang Luo, David Stillwell · Oct 9, 2025

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 68% Moderate protocol signal Freshness: Cold Status: Ready

Rubric Rating Automatic MetricsSimulation Env Medicine

This marks a conceptual departure from traditional automated text scoring by prioritising information gain over fidelity to expert rubrics or human-annotated data.

Open paper

ScholarEval: Research Idea Evaluation Grounded in Literature

Hanane Nour Moussa, Patrick Queiroz Da Silva, Daniel Adu-Ampratwum, Alyson East, Zitong Lu, Nikki Puccetti · Oct 17, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 78% Moderate protocol signal Freshness: Cold Status: Fallback

Rubric Rating Coding

As AI tools become increasingly common for research ideation, robust evaluation is critical to ensure the validity and usefulness of generated ideas.
We introduce ScholarEval, a retrieval augmented evaluation framework that assesses research ideas based on two fundamental criteria: soundness - the empirical validity of proposed methods based on existing literature, and contribution - the…

Open paper

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training

Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch · Sep 25, 2025

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 63% Moderate protocol signal Freshness: Cold Status: Ready

Rubric Rating Automatic Metrics General

Open paper

Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

Adrian Arnaiz-Rodriguez, Miguel Baidal, Erik Derner, Jenn Layton Annable, Mark Ball, Mark Ince · Sep 29, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: sparse protocol signal.

Score: 73% Sparse protocol signal Freshness: Cold Status: Fallback

Rubric Rating Medicine

Despite their support capabilities, safe detection and response to crises such as suicidal ideation and self-harm are still unclear, hindered by the lack of unified crisis taxonomies and clinical evaluation standards.
We also use LLMs to identify crisis inputs and audit five models for response safety and appropriateness.

Open paper

Toward LLM-Supported Automated Assessment of Critical Thinking Subskills

Marisa C. Peczuh, Nischal Ashok Kumar, Ryan Baker, Blair Lehman, Danielle Eisenberg, Caitlin Mills · Oct 14, 2025

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: sparse protocol signal.

Score: 58% Sparse protocol signal Freshness: Cold Status: Fallback

Rubric Rating Coding

As the world becomes increasingly saturated with AI-generated content, disinformation, and algorithmic persuasion, critical thinking - the capacity to evaluate evidence, detect unreliable claims, and exercise independent judgment - is…
We developed a coding rubric based on an established skills progression and completed human coding for a corpus of student essays.

Open paper

Human Psychometric Questionnaires Mischaracterize LLM Psychology: Evidence from Generation Behavior

Woojung Song, Dongmin Choi, Yoonah Park, Jongwook Han, Eun-Ju Lee, Yohan Jo · Sep 12, 2025

Citations: 0

Match reason: Keyword overlap 2/3 across title and protocol fields. Eval-signal density: sparse protocol signal.

Score: 58% Sparse protocol signal Freshness: Cold Status: Fallback

Rubric Rating General

Psychological profiling of large language models (LLMs) using psychometric questionnaires designed for humans has become widespread.
To examine the risk of human questionnaires mischaracterizing LLM psychology, we compare two types of profiles for eight open-source LLMs: self-reported Likert scores from established questionnaires (PVQ-40, PVQ-21, BFI-44, BFI-10) and…

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent