Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 289 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,634) General (532) Long Horizon (320) Pairwise Preference (289) Coding (221) Simulation Env (190) Multi Agent (184) Medicine (117) Llm As Judge (109) Expert Verification (98) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (78) Demonstrations (67) Critique Edit (63)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Apr 16, 2026 · Citations: 0

We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Latent-Condensed Transformer for Efficient Long Context Modeling
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
Apr 13, 2026 · Citations: 0

Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Apr 13, 2026 · Citations: 0

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
Apr 13, 2026 · Citations: 0

Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies.
Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation
Apr 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Apr 13, 2026 · Citations: 0

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
Apr 11, 2026 · Citations: 0

To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Apr 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
Apr 11, 2026 · Citations: 0

To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths.
MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits
Apr 10, 2026 · Citations: 0

On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

HyperMem: Hypergraph Memory for Long-Term Conversations

Juwei Yue, Chuanrui Hu, Jiawei Sheng, Zuyi Zhou, Wenyuan Zhang, Tingwen Liu · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Llm As JudgeAutomatic Metrics General

Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues.
Experiments on the LoCoMo benchmark show that HyperMem achieves state-of-the-art performance with 92.73% LLM-as-a-judge accuracy, demonstrating the effectiveness of HyperMem for long-term conversations.

Open paper

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou, Junshan Zhang · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 65% High protocol signal Freshness: Hot Status: Ready

Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics General

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences.

Open paper

Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

José Pombal, Ricardo Rei, André F. T. Martins · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise PreferenceRubric Rating Llm As Judge Medicine

We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings.
Using IFEval, a benchmark with programmatically verifiable rubrics, we show that SPB persists even when evaluation criteria are entirely objective: among rubrics where generators fail, judges can be up to 50\% more likely to incorrectly…

Open paper

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

Yuchi Wang, Haiyang Yu, Weikang Bian, Jiefeng Long, Xiao Liang, Chao Feng · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Automatic Metrics General

Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.

Open paper

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang · Apr 6, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Automatic Metrics Law

Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation.

Open paper

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka, Takeharu Yoshikawa · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Llm As JudgeAutomatic Metrics MedicineMultilingual

A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%).

Open paper

Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning

Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu, Xinyu Dai · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Automatic Metrics General

However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process.
To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM's generation quality.

Open paper

PLOT: Enhancing Preference Learning via Optimal Transport

Liang Zhu, Yuelin Bai, Xiankun Ren, Jiaxi Yang, Lei Zhang, Feiteng Fang · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Automatic Metrics General

Preference learning in Large Language Models (LLMs) has advanced significantly, yet existing methods remain limited by modest performance gains, high computational costs, hyperparameter sensitivity, and insufficient modeling of global…
We introduce PLOT, which enhances Preference Learning in fine-tuning-based alignment through a token-level loss derived from Optimal Transport.

Open paper

Self-Debias: Self-correcting for Debiasing Large Language Models

Xuan Feng, Shuai Zhao, Luwei Xiao, Tianlong Gu, Bo An · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Long Horizon General

Unlike standard preference optimization which applies broad penalties, Self-Debias employs a fine-grained trajectory-level objective subject to dynamic debiasing constraints.

Open paper

Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

Shiwan Zhao, Zhihu Wang, Xuyang Zhao, Jiaming Zhou, Caiyue Xu, Chenfei Liu · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Long Horizon General

Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines.
SFT may serve either support expansion or policy reshaping, whereas preference-based methods are usually off-policy reshaping.

Open paper

TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation

Xinliang Frederick Zhang, Lu Wang · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Long Horizon General

Personalized large language models (PLLMs) have garnered significant attention for their ability to align outputs with individual's needs and preferences.
Extensive evaluations on long-horizon benchmarks using the Qwen-3 model family (4B to 32B) validate the effectiveness of TSUBASA, surpassing competitive memory-augmented systems that rely primarily on memory writing, such as Mem0 and…

Open paper

Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

Xin Sun, Di Wu, Sijing Qin, Isao Echizen, Abdallah El Ali, Saku Sugawara · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Llm As Judge General

Large language models (LLMs) are increasingly used as automated evaluators (LLM-as-a-Judge).
Using a counterfactual design, we find that both humans and LLM judges assign higher trust to information labeled as human-authored than to the same content labeled as AI-generated.

Open paper

Discourse Coherence and Response-Guided Context Rewriting for Multi-Party Dialogue Generation

Zhiyu Cao, Peifeng Li, Qiaoming Zhu · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback

Pairwise Preference General

Specifically, DRCR employs two complementary feedback signals, discourse coherence and response quality, to construct preference data for both context rewriting and response generation.

Open paper

Controlling Distributional Bias in Multi-Round LLM Generation via KL-Optimized Fine-Tuning

Yanbei Jiang, Amr Keleg, Ryandito Diandaru, Jey Han Lau, Lea Frermann, Biaoyan Fang · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback

Pairwise Preference General

Our empirical analysis reveals that off-the-shelf LLMs and standard alignment techniques, including prompt engineering and Direct Preference Optimization, fail to reliably control output distributions.

Open paper

JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

Aichen Cai, Anmeng Zhang, Anyu Li, Bo Zhang, Bohua Cai, Chang Li · Apr 3, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback

Pairwise Preference General

JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale…

Open paper

Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest

Addison J. Wu, Ryan Liu, Shuyue Stella Li, Yulia Tsvetkov, Thomas L. Griffiths · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Pairwise Preference General

Today's large language models (LLMs) are trained to align with user preferences through methods such as reinforcement learning.
We then present a suite of evaluations to examine how current models handle these tradeoffs.

Open paper

Multi-Faceted Self-Consistent Preference Alignment for Query Rewriting in Conversational Search

Zhiyu Cao, Peifeng Li, Qiaoming Zhu · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Pairwise Preference General

To address this issue, we propose Multi-Faceted Self-Consistent Preference Aligned CQR (MSPA-CQR).
Then we propose prefix guided multi-faceted direct preference optimization to learn preference information from three different dimensions.

Open paper

Evaluating Learner Representations for Differentiation Prior to Instructional Outcomes

Junsoo Park, Youssef Medhat, Htet Phyo Wai, Ploy Thajchayapong, Ashok K. Goel · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Pairwise Preference General

We introduce distinctiveness, a representation-level measure that evaluates how each learner differs from others in the cohort using pairwise distances, without requiring clustering, labels, or task-specific evaluation.
Using student-authored questions collected through a conversational AI agent in an online learning environment, we compare representations based on individual questions with representations that aggregate patterns across a student's…

Open paper

Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not

Sercan Karakaş · Apr 6, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Pairwise Preference Multilingual

Large language models achieve strong performance on many language tasks, yet it remains unclear whether they integrate world knowledge with syntactic structure in a human-like, structure-sensitive way during ambiguity resolution.
In a speeded forced-choice comprehension experiment, humans show a large, correctly directed plausibility effect.

Open paper

DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment

Liang Zhu, Feiteng Fang, Yuelin Bai, Longze Chen, Zhexiang Zhang, Minghuan Tan · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Pairwise Preference).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Pairwise Preference General

Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable.
However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent