Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 239 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

Amirabbas Afzali, Myeongho Jeon, Maria Brbic · Mar 5, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics General
  • Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization…
  • Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO.
Open paper
LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services

Jinwen Chen, Shuai Gong, Shiwen Zhang, Zheng Zhang, Yachao Zhao, Lingxiang Wang · Mar 5, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics General
  • While LLMs offer strong semantic generalization, deploying them in local-life services introduces three key challenges: lack of geographic grounding, exposure bias in preference optimization, and online inference latency.
  • Extensive offline evaluations and large-scale online A/B testing demonstrate that LocalSUG improves click-through rate (CTR) by +0.35% and reduces the low/no-result rate by 2.56%, validating its effectiveness in real-world deployment.
Open paper

Match reason: Matches selected tags (Automatic Metrics).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise PreferenceExpert Verification Automatic Metrics MedicineCoding
  • Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content.
  • Experiments on medical dialogue benchmarks show that PrivMedChat at \varepsilon=7 achieves the highest ROUGE-L of 0.156 among all DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, and obtains the highest overall…
Open paper
Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

Yichi Zhang, Nabeel Seedat, Yinpeng Dong, Peng Cui, Jun Zhu, Mihaela van de Schaar · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics).

Score: 55% High protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics Long Horizon Medicine
  • As LLM-powered agents have been used for high-stakes decision-making, such as clinical diagnosis, it becomes critical to develop reliable verification of their decisions to facilitate trustworthy deployment.
  • We empirically validate GLEAN with agentic clinical diagnosis across three diseases from the MIMIC-IV dataset, surpassing the best baseline by 12% in AUROC and 50% in Brier score reduction, which confirms the effectiveness in both…
Open paper
StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, Caiwen Ding · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics).

Score: 55% High protocol signal Freshness: Hot Status: Ready
Rubric Rating Automatic Metrics Multi Agent Coding
  • To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it…
  • Experiments on KernelBench show that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72x better speedup over the multi-agent baseline and 2.73x than the RL model baselines.
Open paper
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

Zhongxi Wang, Yueqian Lin, Jingyang Zhang, Hai Helen Li, Yiran Chen · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics).

Score: 55% High protocol signal Freshness: Hot Status: Ready
Red Team Automatic Metrics Web Browsing General
  • Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs.
  • We present MUSE (Multimodal Unified Safety Evaluation), an open-source, run-centric platform that integrates automatic cross-modal payload generation, three multi-turn attack algorithms (Crescendo, PAIR, Violent Durian), provider-agnostic…
Open paper
Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation

Harry Stuart, Masahiro Kaneko, Timothy Baldwin · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics).

Score: 55% High protocol signal Freshness: Hot Status: Ready
Rubric Rating Automatic Metrics Coding
  • Effective hiring is integral to the success of an organisation, but it is very challenging to find the most suitable candidates because expert evaluation (e.g.\ interviews conducted by a technical manager) are expensive to deploy at scale.
Open paper
Citations: 0

Match reason: Matches selected tags (Automatic Metrics).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics MathCoding
  • While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct…
Open paper
TimeWarp: Evaluating Web Agents by Revisiting the Past

Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 52% Moderate protocol signal Freshness: Hot Status: Ready
Demonstrations Web Browsing General
  • The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
  • We introduce TimeWarp, a benchmark that emulates the evolving web using containerized environments that vary in UI, design, and layout.
Open paper
ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul, Pakhapoom Sarapat · Mar 5, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback
Llm As JudgeAutomatic Metrics General
  • Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators.
  • Finally, we introduce the ThaiSafetyBench leaderboard to provide continuously updated safety evaluations and encourage community participation.
Open paper
Replaying pre-training data improves fine-tuning

Suhas Kotha, Percy Liang · Mar 5, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Web Browsing Math
  • We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by 4.5\% and Basque question-answering accuracy by 2\%.
Open paper
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu · Mar 4, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics).

Score: 55% High protocol signal Freshness: Hot Status: Fallback
Pairwise Preference Automatic Metrics Web Browsing Coding
  • We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous…
  • We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level…
Open paper
AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning

Nikolas Karafyllis, Maria Lymperaiou, Giorgos Filandrianos, Athanasios Voulodimos, Giorgos Stamou · Mar 4, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics).

Score: 55% High protocol signal Freshness: Hot Status: Fallback
Pairwise Preference Automatic Metrics General
  • We present a winning three-stage system for SemEval 2026 Task~12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design optimized through reflective prompt evolution, and post-hoc…
  • Cross-model error analysis across 14 models (7~families) reveals three shared inductive biases: causal chain incompleteness, proximate cause preference, and salience bias, whose cross-family convergence (51\% cause-count reduction)…
Open paper
$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners

Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan, Xiaoxia Wu · Mar 4, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics).

Score: 55% High protocol signal Freshness: Hot Status: Fallback
Pairwise Preference Automatic Metrics MathCoding
  • On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being…
Open paper
Citations: 0

Match reason: Matches selected tags (Automatic Metrics).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Multi Agent MathCoding
  • As a proof of concept, we present GenDB, an LLM-powered agentic system that generates instance-optimized and customized query execution code tailored to specific data, workloads, and hardware resources.
  • We implemented an early prototype of GenDB that uses Claude Code Agent as the underlying component in the multi-agent system, and we evaluate it on OLAP workloads.
Open paper
From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

Niu Lian, Yuting Wang, Hanshu Yao, Jinpeng Wang, Bin Chen, Yaowei Wang · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon Coding
  • While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive…
  • Extensive experiments across 4 benchmarks confirm the effectiveness of MM-Mem on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization.
Open paper
LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval

Jiajie Jin, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie, Yutao Zhu · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon General
  • Extensive experiments on both in-domain and out-of-domain reasoning-intensive benchmarks demonstrate that LaSER significantly outperforms state-of-the-art baselines.
Open paper
PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology

Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva, Jesse J Hubbard · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics).

Score: 55% High protocol signal Freshness: Hot Status: Fallback
Rubric RatingExpert Verification Llm As JudgeAutomatic Metrics Medicine
  • Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.
  • We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
Open paper
Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility

Angana Borah, Zohaib Khan, Rada Mihalcea, Verónica Pérez-Rosas · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic MetricsSimulation Env General
  • As Large Language Models (LLMs) are increasingly used to simulate human behaviors, we investigate whether they can simulate demographic misinformation susceptibility, treating beliefs as a primary driving factor.
  • We study prompt-based conditioning and post-training adaptation, and conduct a multi-fold evaluation using: (i) susceptibility accuracy and (ii) counterfactual demographic sensitivity.
Open paper

Protocol Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.