Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 411 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

Mauro Cettolo, Marco Gaido, Matteo Negri, Sara Papi, Luisa Bentivogli · Nov 5, 2025

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics Multilingual
  • Automatic evaluation of ST systems is typically performed by comparing translation hypotheses with one or more reference translations.
  • Our experiments, carried out on two ST benchmarks covering 79 language pairs and six ST systems with diverse architectures and performance levels, show that ASR transcripts constitute a more reliable synthetic source than back-translations…
Open paper

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • DELULU significantly outperforms prior SSL models across a range of speaker-centric tasks, achieving up to 62\% relative improvement in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks…
Open paper
Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 71% Sparse protocol signal Freshness: Cold Status: Ready
Simulation Env General
  • We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation that integrates knowledge from business contexts.
  • Through empirical evaluation, we find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors, highlighting its effectiveness as an evaluation tool to support…
Open paper
Seeing Straight: Document Orientation Detection for Efficient OCR

Suranjan Goswami, Abhinav Ravi, Raja Kolla, Ali Faraz, Shaharukh Khan, Akash · Nov 6, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics Multilingual
  • In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from rotation-transformed structured and free-form English OCR datasets, and (ii)…
Open paper

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% High protocol signal Freshness: Cold Status: Ready
Pairwise Preference Automatic Metrics Math
  • This paper proposes Group-relative Implicit Fine-Tuning (GIFT), a reinforcement learning framework for aligning large language models (LLMs) that unifies on-policy optimization with implicit preference learning.
  • Results show that GIFT converges faster, generalizes better with reduced overfitting, and outperforms GRPO on mathematical reasoning benchmarks (GSM8K, MATH, AIME) as well as generation tasks' evaluations (AlpacaEval and Arena-Hard).
Open paper
VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations

Yupeng Xie, Zhiyang Zhang, Yifan Wu, Sirong Lu, Jiayi Zhang, Zhaoyang Yu · Oct 25, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality.
  • To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment.
Open paper

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics MedicineCoding
  • Despite the rapid expansion of Large Language Models (LLMs) in healthcare, robust and explainable evaluation of their ability to assess clinical trial reporting according to CONSORT standards remains an open challenge.
Open paper
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut · Oct 21, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% High protocol signal Freshness: Cold Status: Ready
Rubric Rating Human EvalLlm As Judge General
  • In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g.
  • We show that PoSh achieves stronger correlations (+0.05 Spearman ρ) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable…
Open paper
STARS: Synchronous Token Alignment for Robust Supervision in Large Language Models

Mohammad Atif Quamar, Mohammad Areeb, Mikhail Kuznetsov, Muslum Ozgur Ozmen, Z. Berkay Celik · Nov 5, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics Coding
  • Aligning large language models (LLMs) with human values is crucial for safe deployment.
  • On the HH-RLHF benchmark, we demonstrate that STARS achieves competitive alignment quality with that of state-of-the-art dynamic methods, while strictly bounding rejection costs and maximizing system throughput.
Open paper
Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs

Muhammed Saeed, Muhammad Abdul-mageed, Shady Shehata · Nov 3, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics Multilingual
  • Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks.
  • We introduce \corpusname, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings.
Open paper
Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics

Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag · Oct 24, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise Preference Llm As Judge Multilingual
  • Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and increasingly serve as selection criteria in data filtering and candidate reranking.
  • Through a systematic study of top-performing learned and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation…
Open paper
MA-SAPO: Multi-Agent Reasoning for Score-Aware Prompt Optimization

Wonduk Seo, Juhyeon Lee, Junseo Koh, Wonseok Choi, Hyunjin An, Jian Park · Oct 18, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready
Critique Edit Multi Agent General
  • However, most existing frameworks treat evaluation as a black box, relying solely on outcome scores without explaining why prompts succeed or fail.
  • In this paper, we propose MA-SAPO: a new Multi-Agent Reasoning for Score Aware Prompt Optimization framework that links evaluation outcomes directly to targeted refinements.
Open paper
R-WoM: Retrieval-augmented World Model For Computer-use Agents

Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee, Xing Niu · Oct 13, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% Moderate protocol signal Freshness: Cold Status: Fallback
Simulation Env Long Horizon General
  • Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration.
Open paper
Qubit-centric Transformer for Surface Code Decoding

Seong-Joon Park, Hee-Youl Kwak, Yongjune Kim · Oct 13, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 46% Sparse protocol signal Freshness: Cold Status: Ready
Coding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics

Qinxuan Wang, Chuang Wang, Mingyu Zhang, Jingwei Sun, Peipei Yang, Shuo Tang · Oct 17, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • We evaluate MNO on diverse benchmarks, covering steady-state and unsteady flow scenarios with up to 300k points.
Open paper

Match reason: Matched by broad semantic/index fallback.

Score: 26% Sparse protocol signal Freshness: Cold Status: Ready
Multi Agent MathCoding
  • We address this gap by extending TeXRA with an independent Lean 4 verification layer, turning it into a human-guided multi-agent platform for exact scientific discovery.
Open paper
VISTA: Verification In Sequential Turn-based Assessment

Ashley Lewis, Andrew Perrault, Eric Fosler-Lussier, Michael White · Oct 30, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Fallback
Human EvalLlm As Judge General
  • Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines.
  • Human evaluation confirms that VISTA's decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks.
Open paper
PARL: Prompt-based Agents for Reinforcement Learning

Yarik Menchaca Resendiz, Roman Klinger · Oct 24, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready
Math
  • However, limited work evaluates LLMs as agents in reinforcement learning (RL) tasks (e.g., playing games), where learning occurs through interaction with an environment and a reward system.
  • We show that it can match or outperform traditional RL agents in simple environments by leveraging pretrained knowledge.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.