Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 411 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,651) General (540) Long Horizon (322) Pairwise Preference (294) Coding (222) Simulation Env (191) Multi Agent (185) Medicine (117) Llm As Judge (110) Expert Verification (98) Human Eval (90) Rubric Rating (83) Math (79) Web Browsing (79) Demonstrations (67) Critique Edit (65)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Seeing Fast and Slow: Learning the Flow of Time in Videos
Apr 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability
Apr 23, 2026 · Citations: 0

We argue that this temporal taskification step is not a neutral preprocessing choice, but a structural component of evaluation: different valid splits of the same stream can induce different CL regimes and therefore different benchmark…
Evaluation of Automatic Speech Recognition Using Generative Large Language Models
Apr 23, 2026 · Citations: 0

Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task.
MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Apr 16, 2026 · Citations: 0

We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Latent-Condensed Transformer for Efficient Long Context Modeling
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
Apr 13, 2026 · Citations: 0

Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Apr 13, 2026 · Citations: 0

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
Apr 13, 2026 · Citations: 0

Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies.
Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation
Apr 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Apr 13, 2026 · Citations: 0

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
Apr 11, 2026 · Citations: 0

To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

Mauro Cettolo, Marco Gaido, Matteo Negri, Sara Papi, Luisa Bentivogli · Nov 5, 2025

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Multilingual

Automatic evaluation of ST systems is typically performed by comparing translation hypotheses with one or more reference translations.
Our experiments, carried out on two ST benchmarks covering 79 language pairs and six ST systems with diverse architectures and performance levels, show that ASR transcripts constitute a more reliable synthetic source than back-translations…

Open paper

DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Trained Speech Foundational Model

Massa Baali, Rita Singh, Bhiksha Raj · Oct 20, 2025

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

DELULU significantly outperforms prior SSL models across a range of speaker-centric tasks, achieving up to 62\% relative improvement in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks…

Open paper

SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation

Ryan Shea, Yunan Lu, Liang Qiu, Zhou Yu · Oct 13, 2025

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 71% Sparse protocol signal Freshness: Cold Status: Ready

Simulation Env General

We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation that integrates knowledge from business contexts.
Through empirical evaluation, we find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors, highlighting its effectiveness as an evaluation tool to support…

Open paper

Seeing Straight: Document Orientation Detection for Efficient OCR

Suranjan Goswami, Abhinav Ravi, Raja Kolla, Ali Faraz, Shaharukh Khan, Akash · Nov 6, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Multilingual

In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from rotation-transformed structured and free-form English OCR datasets, and (ii)…

Open paper

Error-Aware Knowledge Distillation via Targeted Revision for Customer-Service Summarization

Hee-Jin Lee, Zhen Guo, Luchao Jin, Morteza Moazami Goudarzi · Nov 4, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% Moderate protocol signal Freshness: Cold Status: Ready

Critique Edit Automatic Metrics General

Open paper

GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA

Zhichao Wang · Oct 27, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% High protocol signal Freshness: Cold Status: Ready

Pairwise Preference Automatic Metrics Math

This paper proposes Group-relative Implicit Fine-Tuning (GIFT), a reinforcement learning framework for aligning large language models (LLMs) that unifies on-policy optimization with implicit preference learning.
Results show that GIFT converges faster, generalizes better with reduced overfitting, and outperforms GRPO on mathematical reasoning benchmarks (GSM8K, MATH, AIME) as well as generation tasks' evaluations (AlpacaEval and Arena-Hard).

Open paper

VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations

Yupeng Xie, Zhiyang Zhang, Yifan Wu, Sirong Lu, Jiayi Zhang, Zhaoyang Yu · Oct 25, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality.
To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment.

Open paper

A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist

Sohyeon Jeon, Hyung-Chul Lee · Oct 22, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics MedicineCoding

Despite the rapid expansion of Large Language Models (LLMs) in healthcare, robust and explainable evaluation of their ability to assess clinical trial reporting according to CONSORT standards remains an open challenge.

Open paper

PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut · Oct 21, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% High protocol signal Freshness: Cold Status: Ready

Rubric Rating Human EvalLlm As Judge General

In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g.
We show that PoSh achieves stronger correlations (+0.05 Spearman ρ) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable…

Open paper

STARS: Synchronous Token Alignment for Robust Supervision in Large Language Models

Mohammad Atif Quamar, Mohammad Areeb, Mikhail Kuznetsov, Muslum Ozgur Ozmen, Z. Berkay Celik · Nov 5, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Coding

Aligning large language models (LLMs) with human values is crucial for safe deployment.
On the HH-RLHF benchmark, we demonstrate that STARS achieves competitive alignment quality with that of state-of-the-art dynamic methods, while strictly bounding rejection costs and maximizing system throughput.

Open paper

Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs

Muhammed Saeed, Muhammad Abdul-mageed, Shady Shehata · Nov 3, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Multilingual

Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks.
We introduce \corpusname, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings.

Open paper

Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics

Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag · Oct 24, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready

Pairwise Preference Llm As Judge Multilingual

Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and increasingly serve as selection criteria in data filtering and candidate reranking.
Through a systematic study of top-performing learned and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation…

Open paper

MA-SAPO: Multi-Agent Reasoning for Score-Aware Prompt Optimization

Wonduk Seo, Juhyeon Lee, Junseo Koh, Wonseok Choi, Hyunjin An, Jian Park · Oct 18, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready

Critique Edit Multi Agent General

However, most existing frameworks treat evaluation as a black box, relying solely on outcome scores without explaining why prompts succeed or fail.
In this paper, we propose MA-SAPO: a new Multi-Agent Reasoning for Score Aware Prompt Optimization framework that links evaluation outcomes directly to targeted refinements.

Open paper

R-WoM: Retrieval-augmented World Model For Computer-use Agents

Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee, Xing Niu · Oct 13, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% Moderate protocol signal Freshness: Cold Status: Fallback

Simulation Env Long Horizon General

Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration.

Open paper

Qubit-centric Transformer for Surface Code Decoding

Seong-Joon Park, Hee-Youl Kwak, Yongjune Kim · Oct 13, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 46% Sparse protocol signal Freshness: Cold Status: Ready

Coding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics

Qinxuan Wang, Chuang Wang, Mingyu Zhang, Jingwei Sun, Peipei Yang, Shuo Tang · Oct 17, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

We evaluate MNO on diverse benchmarks, covering steady-state and unsteady flow scenarios with up to 300k points.

Open paper

Q$^2$: Quantization-Aware Gradient Balancing and Attention Alignment for Low-Bit Quantization

Zhaoyang Wang, Dong Wang · Nov 8, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 33% Moderate protocol signal Freshness: Cold Status: Fallback

Automatic Metrics Long Horizon Medicine

Open paper

Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems

Xi He, Sirui Lu, Bei Zeng · Oct 23, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 26% Sparse protocol signal Freshness: Cold Status: Ready

Multi Agent MathCoding

We address this gap by extending TeXRA with an independent Lean 4 verification layer, turning it into a human-guided multi-agent platform for exact scientific discovery.

Open paper

VISTA: Verification In Sequential Turn-based Assessment

Ashley Lewis, Andrew Perrault, Eric Fosler-Lussier, Michael White · Oct 30, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Fallback

Human EvalLlm As Judge General

Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines.
Human evaluation confirms that VISTA's decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks.

Open paper

PARL: Prompt-based Agents for Reinforcement Learning

Yarik Menchaca Resendiz, Roman Klinger · Oct 24, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready

Math

However, limited work evaluates LLMs as agents in reinforcement learning (RL) tasks (e.g., playing games), where learning occurs through interaction with an environment and a reward system.
We show that it can match or outperform traditional RL agents in simple environments by leveraging pretrained knowledge.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent