Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 55 Search mode: keyword Ranking: eval-signal prioritized Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,620) General (530) Long Horizon (320) Pairwise Preference (288) Coding (218) Simulation Env (187) Multi Agent (182) Medicine (116) Llm As Judge (107) Expert Verification (97) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (77) Demonstrations (67) Critique Edit (63)

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen, Kaiyuan Chen · Mar 9, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 93% High protocol signal Freshness: Warm Status: Ready

Rubric Rating Automatic Metrics Tool Use Law

To this end, we introduce \OneMillion-Bench \OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios.
We adopt a rubric-based evaluation protocol scoring factual accuracy, logical coherence, practical feasibility, and professional compliance, focused on expert-level problems to ensure meaningful differentiation across agents.

Open paper

Document Reconstruction Unlocks Scalable Long-Context RLVR

Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin, Jung-jae Kim · Feb 9, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 93% High protocol signal Freshness: Warm Status: Ready

Rubric Rating Automatic Metrics Coding

However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming.
In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision.

Open paper

PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

Minki Hong, Eunsoo Lee, Sohyun Park, Jihie Kim · Mar 11, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 88% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise PreferenceRubric Rating Automatic Metrics Medicine

We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses.
Across 7 benchmarks and 5 task models, PEEM's accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate Spearman rho about 0.97, Pearson r about 0.94, p < 0.001).

Open paper

KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge

Alex Robertson, Huizhi Liang, Mahbub Gani, Rohit Kumar, Srijith Rajamohan · Feb 23, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 88% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations.
We present KGHaluBench, a Knowledge Graph-based hallucination benchmark that assesses LLMs across the breadth and depth of their knowledge, providing a fairer and more comprehensive insight into LLM truthfulness.

Open paper

Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering

Amine Kobeissi, Philippe Langlais · Feb 20, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 88% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate

Jihao Zhao, Daixuan Li, Pengfei Li, Shuaishuai Zu, Biao Qin, Hongyan Liu · Mar 12, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 93% High protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Multi Agent General

Drawing inspiration from Hal Gregersen's "Questions Are the Answer" theory, we design a multi-agent debate framework comprising four specialized components: a question outline generator, text segmenter, integrity reviewer, and knowledge…
Additionally, to handle long evaluation chains and low efficiency in existing chunking evaluation methods, which overly rely on downstream QA tasks, we introduce a novel direct evaluation metric, ChunkScore.

Open paper

QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

Jiayin Lei, Ming Ma, Yunxi Duan, Chenxi Li, Tianming Yang · Mar 12, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Coding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

S-GRADES -- Studying Generalization of Student Response Assessments in Diverse Evaluative Settings

Tasfia Seuti, Sagnik Ray Choudhury · Mar 10, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

We introduce S-GRADES (Studying Generalization of Student Response Assessments in Diverse Evaluative Settings), a web-based benchmark that consolidates 14 diverse grading datasets under a unified interface with standardized access and…
To demonstrate the utility of S-GRADES, we evaluate three state-of-the-art large language models across the benchmark using multiple reasoning strategies in prompting.

Open paper

Who We Are, Where We Are: Mental Health at the Intersection of Person, Situation, and Large Language Models

Nikita Soni, August Håkan Nilsson, Syeda Mahwish, Vasudha Varadarajan, H. Andrew Schwartz, Ryan L. Boyd · Mar 6, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

These findings underscore the value of integrating computational modeling with psychological theory to assess dynamic mental states in contextually sensitive and human-understandable ways.

Open paper

Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration

Linhao Zhong, Linyu Wu, Wen Wang, Yuling Xi, Chenchen Jing, Jiaheng Zhang · Mar 3, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

However, their non-sequential, bidirectionally masked generation makes quality assessment difficult, underscoring the need for effective self-evaluation.
In this work, we propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs.

Open paper

HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse

Sai Kartheek Reddy Kasu, Shankar Biradar, Sunil Saumya, Md. Shad Akhtar · Mar 3, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Subtle and indirect hate speech remains an underexplored challenge in online safety research, particularly when harmful intent is embedded within misleading or manipulative narratives.
We benchmark multiple open-source language models on HateMirage using ROUGE-L F1 and Sentence-BERT similarity to assess explanation coherence.

Open paper

Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

Siyuan Liu, Jiahui Xu, Feng Jiang, Kuang Wang, Zefeng Zhao, Chu-Ren Huang · Feb 26, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems.
Experiments on two spoken dialogue benchmarks demonstrate that DDTSR reduces response latency by 19%-51% while preserving discourse quality.

Open paper

Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions

Raymond Li, Amirhossein Abaskohi, Chuyuan Li, Gabriel Murray, Giuseppe Carenini · Feb 20, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue

Adib Sakhawat, Fardeen Sadab, Rakin Shahriar · Feb 19, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions.

Open paper

Simulating Meaning, Nevermore! Introducing ICR: A Semiotic-Hermeneutic Metric for Evaluating Meaning in LLM Text Summaries

Natalie Perez, Sreyoshi Bhaduri, Aman Chadha · Feb 3, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Meaning in human language is relational, context dependent, and emergent, arising from dynamic systems of signs rather than fixed word-concept mappings.
We then introduce the Inductive Conceptual Rating (ICR) metric, a qualitative evaluation approach grounded in inductive content analysis and reflexive thematic analysis, designed to assess semantic accuracy and meaning alignment in…

Open paper

Rewriting Video: Text-Driven Reauthoring of Video Footage

Sitong Wang, Anh Truong, Lydia B. Chilton, Dingzeyu Li · Jan 13, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

A technical evaluation of the algorithm reveals a critical human-AI perceptual gap.

Open paper

Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement

Dongxu Zhang, Hongqiang Lin, Yiding Sun, Pengyu Wang, Qirui Wang, Ning Yang · Mar 9, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 88% Moderate protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Long Horizon General

Open paper

ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning

Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 88% Moderate protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Long Horizon Math

We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution.
It turns math problems into a controlled, correctness-checkable benchmark with tool sets, enabling systematic evaluation of model reliability under (1) large, overlapping tool catalogs and (2) the absence of the intended capability.

Open paper

Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning

Justin Lovelace, Christian Belardi, Sofian Zalouk, Adhitya Polavaram, Srivatsa Kundurthy, Kilian Q. Weinberger · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Fallback

Llm As JudgeAutomatic Metrics General

Evaluations show STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves >70\% win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning.

Open paper

KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification

Erfan Nourbakhsh, Nasrin Sanjari, Ali Nourbakhsh · Dec 9, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics MedicineCoding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent