Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 664 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,634) General (532) Long Horizon (320) Pairwise Preference (289) Coding (221) Simulation Env (190) Multi Agent (184) Medicine (117) Llm As Judge (109) Expert Verification (98) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (78) Demonstrations (67) Critique Edit (63)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Apr 16, 2026 · Citations: 0

We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Latent-Condensed Transformer for Efficient Long Context Modeling
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
Apr 13, 2026 · Citations: 0

Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Apr 13, 2026 · Citations: 0

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
Apr 13, 2026 · Citations: 0

Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies.
Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation
Apr 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Apr 13, 2026 · Citations: 0

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
Apr 11, 2026 · Citations: 0

To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Apr 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
Apr 11, 2026 · Citations: 0

To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths.
MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits
Apr 10, 2026 · Citations: 0

On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

FML-bench: Benchmarking Machine Learning Agents for Scientific Research

Qiran Zou, Hou Hei Lam, Wenhao Zhao, Yiming Tang, Tingting Chen, Samson Yu · Oct 12, 2025

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

To more comprehensively evaluate agents in scientific research settings, we introduce FML-bench, a benchmark comprising 8 diverse and fundamental ML research tasks, and further propose complementary metrics, notably Exploration Diversity,…
We evaluate state-of-the-art research agents on FML-bench, showing that agents employing broad exploration strategies exhibit higher exploration diversity and achieve superior performance, and that exploration diversity positively…

Open paper

Neuron-Level Analysis of Cultural Understanding in Large Language Models

Taisei Yamamoto, Ryoma Kumon, Danushka Bollegala, Hitomi Yanaka · Oct 9, 2025

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 68% Sparse protocol signal Freshness: Cold Status: Ready

Coding

We validate their role by showing that suppressing them substantially degrades performance on cultural benchmarks (by up to 30%), while performance on general natural language understanding (NLU) benchmarks remains largely unaffected.
Finally, we demonstrate that training on NLU benchmarks can diminish models' cultural understanding when we update modules containing many culture-general neurons.

Open paper

Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers

Tuhin Chakrabarty, Jane C. Ginsburg, Paramveer Dhillon · Oct 15, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% Moderate protocol signal Freshness: Cold Status: Ready

Pairwise Preference Automatic Metrics Law

In blind pairwise evaluations by 28 MFA-trained readers and 516 college-educated general readers, AI text from in-context prompting was strongly disfavored by MFA readers for stylistic fidelity (OR=0.16) and quality (OR=0.13), while general…
Fine-tuning ChatGPT on authors' complete works reversed these results: MFA readers favored AI for fidelity (OR=8.16) and quality (OR=1.87), with general readers showing even stronger preference (fidelity OR=16.65; quality OR=5.42).

Open paper

Soft-Masked Diffusion Language Models

Michael Hersche, Samuel Moor-Smith, Thomas Hofmann, Abbas Rahimi · Oct 20, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Coding

SM consistently improves performance across multiple coding benchmarks, particularly in high-throughput settings.

Open paper

FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution

Syed Rifat Raiyan, Md Farhan Ishmam, Abdullah Al Imran, Mohammad Ali Moni · Oct 18, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Human communication heavily relies on laconism and inferential pragmatics, allowing listeners to successfully reconstruct rich meaning from sparse, telegraphic speech.

Open paper

PluriHopRAG: Exhaustive, Recall-Sensitive QA Through Corpus-Specific Document Structure Learning

Mykolas Sveistrys, Richard Kunert · Oct 16, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics LawMedicine

To study this setting, we introduce PluriHopWIND, a multilingual diagnostic benchmark of 48 pluri-hop questions over 191 real wind-industry reports, with high repetitiveness to reflect the challenge of distractors in real-world datasets.
We test PluriHopRAG on PluriHopWIND and the Loong benchmark built on financial, legal and scientific reports.

Open paper

DropVLA: An Action-Level Backdoor Attack on Vision--Language--Action Models

Zonghuan Xu, Xiang Zheng, Xingjun Ma, Yu-Gang Jiang · Oct 13, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

The backdoor remains robust to moderate trigger variations and transfers across evaluation suites (96.27%, 99.09%), whereas text-only largely fails (0.72%).
These results reveal that VLA models can be covertly steered at the granularity of safety-critical actions with minimal poisoning and without observable degradation of nominal performance.

Open paper

Native Hybrid Attention for Efficient Sequence Modeling

Jusen Du, Jiaxi Hu, Tao Zhang, Weigao Sun, Yu Cheng · Oct 8, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Coding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding

Siddeshwar Raghavan, Tanwi Mallick · Oct 9, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% Moderate protocol signal Freshness: Cold Status: Fallback

Automatic Metrics Multi Agent Coding

We present MOSAIC, a multi-agent Large Language Model (LLM) framework for solving challenging scientific coding tasks.
We evaluate MOSAIC on scientific coding benchmarks and demonstrate that our specialized agentic framework outperforms existing approaches in terms of accuracy, robustness, and interpretability.

Open paper

SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation

Ryan Shea, Yunan Lu, Liang Qiu, Zhou Yu · Oct 13, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 49% Sparse protocol signal Freshness: Cold Status: Ready

Simulation Env General

We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation that integrates knowledge from business contexts.
Through empirical evaluation, we find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors, highlighting its effectiveness as an evaluation tool to support…

Open paper

ToolDreamer: Instilling LLM Reasoning Into Tool Retrievers

Saptarshi Sengupta, Zhengyu Zhou, Jun Araki, Xingbo Wang, Bingqing Wang, Suhang Wang · Oct 22, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 46% Sparse protocol signal Freshness: Cold Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG

Annisaa Fitri Nurfidausi, Eleonora Mancini, Paolo Torroni · Oct 16, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 46% Sparse protocol signal Freshness: Cold Status: Ready

General

However, existing studies are limited in scope, lack systematic comparisons of features, and suffer from inconsistent evaluation protocols.
Consistent subject-independent splits are applied to ensure robust, reproducible benchmarking.

Open paper

When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

Lang Gao, Xuhui Li, Chenxi Wang, Mingzhe Li, Wei Liu, Zirui Song · Oct 14, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 46% Sparse protocol signal Freshness: Cold Status: Ready

General

In this paper, we introduce \dataset, the first benchmark for evaluating detector robustness in personalized settings, built from literary and blog texts paired with their LLM-generated imitations.

Open paper

Search-R3: Unifying Reasoning and Embedding in Large Language Models

Yuntao Gui, James Cheng · Oct 8, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 46% Sparse protocol signal Freshness: Cold Status: Ready

General

Our extensive evaluations on diverse benchmarks demonstrate that Search-R3 significantly outperforms prior methods by unifying the reasoning and embedding generation processes.

Open paper

How Reliable is Language Model Micro-Benchmarking?

Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta · Oct 9, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 33% High protocol signal Freshness: Cold Status: Ready

Pairwise Preference Automatic Metrics General

We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark.
In order to consistently rank model pairs with relatively similar performances, we show that often as many as 250 examples must be selected, at which point random sampling is competitive with existing micro-benchmarking methods.

Open paper

Closing the Gap Between Text and Speech Understanding in LLMs

Santiago Cuervo, Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu, Tatiana Likhomanenko · Oct 15, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 26% Sparse protocol signal Freshness: Cold Status: Ready

General

Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech…

Open paper

Too Open for Opinion? Embracing Open-Endedness in Large Language Models for Social Simulation

Bolei Ma, Yong Cao, Indira Sen, Anna-Carolina Haensch, Frauke Kreuter, Barbara Plank · Oct 14, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 26% Sparse protocol signal Freshness: Cold Status: Ready

Simulation Env General

We call for novel practices and evaluation frameworks that leverage rather than constrain the open-ended generative diversity of LLMs, creating synergies between NLP and social science.

Open paper

NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions

Haolin Yang, Yuxing Long, Zhuoyuan Yu, Zihan Yang, Minghan Wang, Jiapeng Xu · Oct 9, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 26% Sparse protocol signal Freshness: Cold Status: Ready

Long Horizon General

Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents' spatial perception and reasoning capabilities.
In this work, we introduce the NavSpace benchmark, which contains six task categories and 1,228 trajectory-instruction pairs designed to probe the spatial intelligence of navigation agents.

Open paper

Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models

Matteo Silvestri, Fabiano Veglianti, Flavio Giorgi, Fabrizio Silvestri, Gabriele Tolomei · Oct 23, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready

General

In contrast, we propose a framework for assessing contamination in tabular datasets by generating controlled queries and performing comparative evaluation.
These findings suggest that performance on downstream tasks involving such datasets may be substantially inflated, raising concerns about the reliability of current evaluation practices.

Open paper

Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

Haeji Jung, Jinju Kim, Kyungjin Kim, Youjeong Roh, David R. Mortensen · Oct 12, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready

Multilingual

We evaluate each model on three downstream tasks -- named entity recognition (NER), part-of-speech tagging (POS) and natural language inference (NLI) -- and find that romanization significantly outperforms other input types in 11 out of 12…

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent