Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 742 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,634) General (532) Long Horizon (320) Pairwise Preference (289) Coding (221) Simulation Env (190) Multi Agent (184) Medicine (117) Llm As Judge (109) Expert Verification (98) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (78) Demonstrations (67) Critique Edit (63)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Apr 16, 2026 · Citations: 0

We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Latent-Condensed Transformer for Efficient Long Context Modeling
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
Apr 13, 2026 · Citations: 0

Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Apr 13, 2026 · Citations: 0

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
Apr 13, 2026 · Citations: 0

Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies.
Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation
Apr 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Apr 13, 2026 · Citations: 0

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
Apr 11, 2026 · Citations: 0

To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Apr 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
Apr 11, 2026 · Citations: 0

To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths.
MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits
Apr 10, 2026 · Citations: 0

On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

Manh Luong, Khai Nguyen, Dinh Phung, Gholamreza Haffari, Lizhen Qu · Feb 8, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Our kernel also improves the reasoning accuracy of the MMAU-test-mini benchmarks by 4\%.

Open paper

Query pipeline optimization for cancer patient question answering systems

Maolin He, Rena Gao, Mike Conway, Brian E. Chapman · Dec 19, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Medicine

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Imitating AI agents increase diversity in homogeneous information environments but can reduce it in heterogeneous ones

Emil Bakkensen Johansen, Oliver Baumann · Mar 20, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 71% Sparse protocol signal Freshness: Cold Status: Ready

Simulation Env General

Recent developments in large language models (LLMs) have facilitated autonomous AI agents capable of imitating human-generated content, raising fundamental questions about how AI may reshape democratic information environments such as news.
Varying imitation strategies and AI prevalence across information environments with different baseline structures, we show that the effects of AI-driven imitation are strongly context-dependent: imitating AI agents increase semantic…

Open paper

InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models

Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao · Mar 9, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 71% Sparse protocol signal Freshness: Cold Status: Ready

Math

Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-11% improvements across MATH500, AIME24, and GPQA_diamond benchmarks.

Open paper

MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task

Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Xin Xu, Mengdi Zhang · Feb 17, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 71% Sparse protocol signal Freshness: Cold Status: Ready

MathCoding

Through comprehensive experiments on multiple mathematical reasoning datasets, including MathInstruct, MetaMathQA and etc., we demonstrate that models trained on MathFimer-expanded data consistently outperform their counterparts trained on…

Open paper

Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Bettina Messmer, Vinko Sabolčec, Martin Jaggi · Feb 14, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 71% Sparse protocol signal Freshness: Cold Status: Ready

Multilingual

Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks and mitigating the curse of…

Open paper

Post-detection inference for sequential changepoint localization

Aytijhya Saha, Aaditya Ramdas · Feb 10, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 71% Sparse protocol signal Freshness: Cold Status: Ready

Simulation Env General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

A Survey on the Optimization of Large Language Model-based Agents

Shangheng Du, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xin Jiang, Yanhong Bai · Mar 16, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Fallback

Simulation Env Long Horizon General

With the rapid development of Large Language Models (LLMs), LLM-based agents have been widely adopted in various fields, becoming essential for autonomous decision-making and interactive tasks.
However, current work typically relies on prompt design or fine-tuning strategies applied to vanilla LLMs, which often leads to limited effectiveness or suboptimal performance in complex agent-related environments.

Open paper

LMUnit: Fine-grained Evaluation with Natural Language Unit Tests

Jon Saad-Falcon, Rajan Vivek, William Berrios, Nandita Shankar Naik, Matija Franklin, Bertie Vidgen · Dec 17, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% High protocol signal Freshness: Cold Status: Fallback

Pairwise Preference Human Eval General

We introduce natural language unit tests, a paradigm that decomposes response quality into explicit, testable criteria, along with a unified scoring model, LMUnit, which combines multi-objective training across preferences, direct ratings,…
Through controlled human studies, we show this paradigm significantly improves inter-annotator agreement and enables more effective LLM development workflows.

Open paper

Evaluation of Large Language Models via Coupled Token Generation

Nina Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco, Suhas Thejaswi · Feb 3, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Fallback

Pairwise Preference General

In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning.
We find that, across multiple benchmark datasets, coupled autoregressive generation requires up to 75% fewer samples to reach the same conclusions as vanilla autoregressive generation.

Open paper

Preference Leakage: A Contamination Problem in LLM-as-a-judge

Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han · Feb 3, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Fallback

Pairwise Preference Llm As Judge General

Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development.
While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm.

Open paper

Lexical categories of stem-forming roots in Mapudüngun verb forms

Andrés Chandía · Feb 11, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 71% Sparse protocol signal Freshness: Cold Status: Fallback

Critique Edit General

Open paper

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang · Jan 24, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 33% High protocol signal Freshness: Cold Status: Ready

Automatic Metrics Math

However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities.
In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage.

Open paper

Interpretable Deep Learning Framework for Improved Disease Classification in Medical Imaging

Jutika Borah, Hidam Kumarjit Singh · Mar 14, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Medicine

The framework is evaluated on four medical imaging benchmark datasets: chest X-rays of COVID-19, Tuberculosis, Pneumonia, and retinal Optical Coherence Tomography (OCT) images.

Open paper

Safe Reinforcement Learning for Real-World Engine Control

Julian Bedei, Lucas Koch, Kevin Badalian, Alexander Winkler, Patrick Schaber, Jakob Andert · Jan 28, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 26% Sparse protocol signal Freshness: Cold Status: Ready

General

This work introduces a toolchain for applying Reinforcement Learning (RL), specifically the Deep Deterministic Policy Gradient (DDPG) algorithm, in safety-critical real-world environments.
RL provides a viable solution, however, safety concerns, such as excessive pressure rise rates, must be addressed when applying to HCCI.

Open paper

Representing data in words: A context engineering approach

Amandine M. Caut, Amy Rouillard, Beimnet Zenebe, Matthias Green, Ágúst Pálmason Morthens, David J. T. Sumpter · Jan 27, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Fallback

Llm As JudgeAutomatic Metrics General

Due to the absence of standardized benchmarks for this specific task, we conduct LLM-as-a-judge and human-as-a-judge evaluations to assess accuracy across the three applications.

Open paper

Bridging Gaps in Natural Language Processing for Yorùbá: A Systematic Review of a Decade of Progress and Prospects

Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 24, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready

CodingMultilingual

Natural Language Processing (NLP) is becoming a dominant subset of artificial intelligence as the need to help machines understand human language looks indispensable.

Open paper

Improving Language Models with Intentional Analysis

Yuwei Yin, Giuseppe Carenini · Feb 7, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready

General

Intent, a critical cognitive notion and mental state, is ubiquitous in human communication and problem-solving.
Comprehensive experiments across diverse benchmarks, model types, and configurations demonstrate the effectiveness, robustness, and generalizability of IA.

Open paper

Multilingual LLMs Struggle to Link Orthography and Semantics in Bilingual Word Processing

Eshaan Tanwar, Gayatri Oke, Tanmoy Chakraborty · Jan 15, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready

Multilingual

In humans, this is evident in the ease with which cognate words - words similar in both orthographic form and meaning (e.g., blind, meaning "sightless" in both English and German) - are processed, compared to the challenges posed by…

Open paper

LLM4AD: A Platform for Algorithm Design with Large Language Model

Fei Liu, Rui Zhang, Zhuoliang Xie, Rui Sun, Kai Li, Qinglong Hu · Dec 23, 2024

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready

General

We have also designed a unified evaluation sandbox to ensure a secure and robust assessment of algorithms.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent