Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 253 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

TimeWarp: Evaluating Web Agents by Revisiting the Past

Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026

Citations: 0

Match reason: Matches selected tags (General, Demonstrations).

Score: 62% Moderate protocol signal Freshness: Hot Status: Ready
Demonstrations Web Browsing General
  • The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
  • We introduce TimeWarp, a benchmark that emulates the evolving web using containerized environments that vary in UI, design, and layout.
Open paper
VRM: Teaching Reward Models to Understand Authentic Human Preferences

Biao Liu, Ning Xu, Junming Yang, Hao Xu, Xin Geng · Mar 5, 2026

Citations: 0

Match reason: Matches selected tags (General).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Human Eval General
  • Large Language Models (LLMs) have achieved remarkable success across diverse natural language tasks, yet the reward models employed for aligning LLMs often encounter challenges of reward hacking, where the approaches predominantly rely on…
  • Motivated by this consideration, we propose VRM, i.e., Variational Reward Modeling, a novel framework that explicitly models the evaluation process of human preference judgments by incorporating both high-dimensional objective weights and…
Open paper
When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

Amirabbas Afzali, Myeongho Jeon, Maria Brbic · Mar 5, 2026

Citations: 0

Match reason: Matches selected tags (General).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics General
  • Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization…
  • Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO.
Open paper
LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services

Jinwen Chen, Shuai Gong, Shiwen Zhang, Zheng Zhang, Yachao Zhao, Lingxiang Wang · Mar 5, 2026

Citations: 0

Match reason: Matches selected tags (General).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics General
  • While LLMs offer strong semantic generalization, deploying them in local-life services introduces three key challenges: lack of geographic grounding, exposure bias in preference optimization, and online inference latency.
  • Extensive offline evaluations and large-scale online A/B testing demonstrate that LocalSUG improves click-through rate (CTR) by +0.35% and reduces the low/no-result rate by 2.56%, validating its effectiveness in real-world deployment.
Open paper
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

Zhongxi Wang, Yueqian Lin, Jingyang Zhang, Hai Helen Li, Yiran Chen · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (General).

Score: 55% High protocol signal Freshness: Hot Status: Ready
Red Team Automatic Metrics Web Browsing General
  • Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs.
  • We present MUSE (Multimodal Unified Safety Evaluation), an open-source, run-centric platform that integrates automatic cross-modal payload generation, three multi-turn attack algorithms (Crescendo, PAIR, Violent Durian), provider-agnostic…
Open paper
ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul, Pakhapoom Sarapat · Mar 5, 2026

Citations: 0

Match reason: Matches selected tags (General).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback
Llm As JudgeAutomatic Metrics General
  • Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators.
  • Finally, we introduce the ThaiSafetyBench leaderboard to provide continuously updated safety evaluations and encourage community participation.
Open paper
AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning

Nikolas Karafyllis, Maria Lymperaiou, Giorgos Filandrianos, Athanasios Voulodimos, Giorgos Stamou · Mar 4, 2026

Citations: 0

Match reason: Matches selected tags (General).

Score: 55% High protocol signal Freshness: Hot Status: Fallback
Pairwise Preference Automatic Metrics General
  • We present a winning three-stage system for SemEval 2026 Task~12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design optimized through reflective prompt evolution, and post-hoc…
  • Cross-model error analysis across 14 models (7~families) reveals three shared inductive biases: causal chain incompleteness, proximate cause preference, and salience bias, whose cross-family convergence (51\% cause-count reduction)…
Open paper
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright, Marcus Yearwood · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (General).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback
Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon General
  • Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
  • We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations.
Open paper
Contextualized Privacy Defense for LLM Agents

Yule Wen, Yanzhe Zhang, Jianxun Lian, Xiaoyuan Yi, Xing Xie, Diyi Yang · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (General).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback
Simulation Env Long Horizon General
  • LLM agents increasingly act on users' personal information, yet existing privacy defenses remain limited in both design and adaptability.
  • These paradigms are insufficient for supporting contextual, proactive privacy decisions in multi-step agent execution.
Open paper
AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Cheng Jiayang, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao, Yangqiu Song · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (General).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback
Simulation Env Long Horizon General
  • Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory.
  • To address these gaps, we introduce AMemGym, an interactive environment enabling on-policy evaluation and optimization for memory-driven personalization.
Open paper
LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval

Jiajie Jin, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie, Yutao Zhu · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (General).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon General
  • Extensive experiments on both in-domain and out-of-domain reasoning-intensive benchmarks demonstrate that LaSER significantly outperforms state-of-the-art baselines.
Open paper

Match reason: Matches selected tags (General).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback
Simulation Env Multi Agent General
  • We report four preregistered studies (1,584 multi-agent simulations across 16 languages and three model families) demonstrating that alignment interventions in large language models produce a structurally analogous phenomenon: surface…
  • Study 3 (N = 180) tested individuation as countermeasure; individuated agents became the primary source of both pathology and dissociation (DI = +1.120) with conformity above 84%--demonstrating iatrogenesis.
Open paper
Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility

Angana Borah, Zohaib Khan, Rada Mihalcea, Verónica Pérez-Rosas · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (General).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic MetricsSimulation Env General
  • As Large Language Models (LLMs) are increasingly used to simulate human behaviors, we investigate whether they can simulate demographic misinformation susceptibility, treating beliefs as a primary driving factor.
  • We study prompt-based conditioning and post-training adaptation, and conduct a multi-fold evaluation using: (i) susceptibility accuracy and (ii) counterfactual demographic sensitivity.
Open paper
Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning

Guilhem Fouilhé, Rebecca Eifler, Antonin Poché, Sylvie Thiébaux, Nicholas Asher · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (General).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback
Pairwise Preference Multi Agent General
  • When automating plan generation for a real-world sequential decision problem, the goal is often not to replace the human planner, but to facilitate an iterative reasoning and elicitation process, where the human's role is to guide the AI…
  • To enable natural interaction with such a system, we present a multi-agent Large Language Model (LLM) architecture that is agnostic to the explanation framework and enables user- and context-dependent interactive explanations.
Open paper

Match reason: Matches selected tags (General).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback
Rlaif Or Synthetic Feedback General
  • AI safety via debate and reinforcement learning from AI feedback (RLAIF) are both proposed methods for scalable oversight of advanced AI systems, yet no formal framework relates them or characterizes when debate offers an advantage.
  • When models share identical training corpora, debate reduces to RLAIF-like where a single-agent method recovers the same optimum.
Open paper
TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Zhi Xu, Jiaqi Li, Xiaotong Zhang, Hong Yu, Han Liu · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (General).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback
Red Team General
  • Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses.
Open paper
Eval4Sim: An Evaluation Framework for Persona Simulation

Eliseo Bao, Anxo Perez, Xi Wang, Javier Parapar · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (General).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback
Llm As JudgeSimulation Env General
  • Large Language Model (LLM) personas with explicit specifications of attributes, background, and behavioural tendencies are increasingly used to simulate human conversations for tasks such as user modeling, social reasoning, and behavioural…
  • Ensuring that persona-grounded simulations faithfully reflect human conversational behaviour is therefore critical.
Open paper
Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale

Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui, Yiqun Zhang · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (General).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback
Pairwise Preference General
  • The rapid proliferation of Claude agent skills has raised the central question of how to effectively leverage, manage, and scale the agent skill ecosystem.
  • In this paper, we propose AgentSkillOS, the first principled framework for skill selection, orchestration, and ecosystem-level management.
Open paper

Protocol Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.