Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 888 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,634) General (532) Long Horizon (320) Pairwise Preference (289) Coding (221) Simulation Env (190) Multi Agent (184) Medicine (117) Llm As Judge (109) Expert Verification (98) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (78) Demonstrations (67) Critique Edit (63)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Apr 16, 2026 · Citations: 0

We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Latent-Condensed Transformer for Efficient Long Context Modeling
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
Apr 13, 2026 · Citations: 0

Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Apr 13, 2026 · Citations: 0

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
Apr 13, 2026 · Citations: 0

Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies.
Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation
Apr 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Apr 13, 2026 · Citations: 0

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
Apr 11, 2026 · Citations: 0

To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Apr 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
Apr 11, 2026 · Citations: 0

To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths.
MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits
Apr 10, 2026 · Citations: 0

On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang · Jan 24, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% High protocol signal Freshness: Cold Status: Ready

Automatic Metrics Math

However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities.
In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage.

Open paper

The Limits of Inference Scaling Through Resampling

Benedikt Stroebl, Sayash Kapoor, Arvind Narayanan · Nov 26, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Coding

Our analysis shows that there is a strong correlation between the model's single-sample accuracy and its false positive rate on HumanEval and MBPP, whose unit tests have limited coverage.

Open paper

EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

Shih-Yang Liu, Maksim Khadkevich, Nai Chit Fung, Charbel Sakr, Chao-Han Huck Yang, Chien-Yi Wang · Oct 28, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics MathCoding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Multi-Head RAG: Solving Multi-Aspect Problems with LLMs

Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli, Patrik Okanovic · Jun 7, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

MRAG integrates seamlessly with existing RAG frameworks and benchmarks.

Open paper

Augmenting Lateral Thinking in Language Models with Humor and Riddle Data for the BRAINTEASER Task

Mina Ghashami, Soumya Smruti Mishra · May 16, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

LLMs as Repositories of Factual Knowledge: Limitations and Solutions

Seyed Mahed Mousavi, Simone Alghisi, Giuseppe Riccardi · Jan 22, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Query pipeline optimization for cancer patient question answering systems

Maolin He, Rena Gao, Mike Conway, Brian E. Chapman · Dec 19, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Medicine

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

From XAI to MLOps: Explainable Concept Drift Detection with Profile Drift Detection

Ugur Dar, Mustafa Cavus · Dec 15, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Score-matching-based Structure Learning for Temporal Data on Networks

Hao Chen, Kai Yi, Yu Guang Wang · Dec 10, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Among them, the score-matching method has demonstrated superior performance across various evaluation metrics, particularly for the commonly encountered Additive Nonlinear Causal Models.

Open paper

Predicting Subway Passenger Flows under Incident Situation with Causality

Xiannan Huang, Shuhan Qiu, Quan Yuan, Chao Yang · Dec 9, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Deconfounded Time Series Forecasting: A Causal Inference Approach

Wentao Gao, Xiaojing Du, Wenjun Yu, Xiongren Chen, Yifan Guo, Feiyu Yang · Oct 27, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Code Comprehension then Auditing for Unsupervised LLM Evaluation

Bhrij Patel, Souradip Chakraborty, Mengdi Wang, Dinesh Manocha, Amrit Singh Bedi · Oct 4, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Coding

Large Language Models (LLMs) for unsupervised code correctness evaluation have recently gained attention because they can judge if code runs as intended without requiring reference implementations or unit tests, which may be unavailable,…
By sequentially sampling comprehension before evaluation, CoCoA improves the quality of inferred program behavior and enables the evaluator to focus on behavioral alignment rather than raw implementation details.

Open paper

Abstracted Gaussian Prototypes for True One-Shot Concept Learning

Chelsea Zou, Kenneth J. Kurtz · Aug 30, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Results from human judges reveal that the generative pipeline produces novel examples and classes of visual concepts that are broadly indistinguishable from those made by humans.

Open paper

EventChat: Implementation and user-centric evaluation of a large language model-driven conversational recommender system for exploring leisure events in an SME context

Hannes Kunstmann, Joseph Ollier, Joel Persson, Florian von Wangenheim · Jul 5, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Yet to date, research has predominantly focused upon technical frameworks to implement LLM-driven CRS, rather than end-user evaluations or strategic implications for firms, particularly from the perspective of a small to medium enterprises…
In the current paper, we detail the design of an LLM-driven CRS in an SME setting, and its subsequent performance in the field using both objective system metrics and subjective user evaluations.

Open paper

Analyzing Language Bias Between French and English in Conventional Multilingual Sentiment Analysis Models

Ethan Parker Wong, Faten M'hiri · May 7, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Multilingual

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models

Qiong Wu, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji · Mar 22, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

To validate EAS, we apply it to a recently proposed MLLM called LaVIN and a classic VL pre-trained model called METER, and conduct extensive experiments on a set of benchmarks.

Open paper

Toward Adaptive Large Language Models Structured Pruning via Hybrid-grained Weight Importance Assessment

Jun Liu, Zhenglun Kong, Pu Zhao, Changdi Yang, Hao Tang, Xuan Shen · Mar 16, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Coding

Building on this insight, we introduce the Hybrid-grained Weight Importance Assessment (HyWIA), a novel method that merges fine-grained and coarse-grained evaluations of weight importance for the pruning of LLMs.
Extensive experiments on LLaMA-V1/V2, Vicuna, Baichuan, and Bloom across various benchmarks demonstrate the effectiveness of HyWIA in pruning LLMs.

Open paper

SimSiam Naming Game: A Unified Approach for Representation Learning and Emergent Communication

Nguyen Le Hoang, Tadahiro Taniguchi, Fang Tianwei, Akira Taniguchi · Oct 29, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Fallback

Automatic Metrics Multi Agent General

Emergent Communication (EmCom) investigates how agents develop symbolic communication through interaction without predefined language.
In this work, we propose the SimSiam Naming Game (SSNG), a feedback-free EmCom framework that replaces sampling-based updates with a symmetric, self-supervised representation alignment objective between autonomous agents.

Open paper

RouteNet-Gauss: Hardware-Enhanced Network Modeling with Machine Learning

Carlos Güemes-Palau, Miquel Ferriol-Galmés, Jordi Paillisse-Vilanova, Albert López-Brescó, Pere Barlet-Ros, Albert Cabellos-Aparicio · Jan 15, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Fallback

Automatic MetricsSimulation Env General

Open paper

Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms

Joshua Ashkinaze, Ruijia Guan, Laura Kurek, Eytan Adar, Ceren Budak, Eric Gilbert · Jul 4, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.