Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 63 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (630) General (243) Pairwise Preference (128) Long Horizon (127) Coding (96) Simulation Env (86) Multi Agent (62) Medicine (40) Expert Verification (39) Llm As Judge (39) Web Browsing (34) Rubric Rating (32) Demonstrations (31) Red Team (30) Human Eval (29) Math (29)

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, Caiwen Ding · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 55% High protocol signal Freshness: Hot Status: Ready

Rubric Rating Automatic Metrics Multi Agent Coding

To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it…
Experiments on KernelBench show that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72x better speedup over the multi-agent baseline and 2.73x than the RL model baselines.

Open paper

Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation

Harry Stuart, Masahiro Kaneko, Timothy Baldwin · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 55% High protocol signal Freshness: Hot Status: Ready

Rubric Rating Automatic Metrics Coding

Effective hiring is integral to the success of an organisation, but it is very challenging to find the most suitable candidates because expert evaluation (e.g.\ interviews conducted by a technical manager) are expensive to deploy at scale.

Open paper

RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning

Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li, Zhen Qin · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready

Rubric Rating Automatic Metrics General

Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.

Open paper

RoboPocket: Improve Robot Policies Instantly with Your Phone

Junjie Fang, Wendi Chen, Han Xue, Fangyuan Zhou, Tian Le, Yi Wang · Mar 5, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 52% Moderate protocol signal Freshness: Hot Status: Ready

Demonstrations Long Horizon Law

Open paper

TimeWarp: Evaluating Web Agents by Revisiting the Past

Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 52% Moderate protocol signal Freshness: Hot Status: Ready

Demonstrations Web Browsing General

The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
We introduce TimeWarp, a benchmark that emulates the evolving web using containerized environments that vary in UI, design, and layout.

Open paper

ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li, Chuanmiao Yan · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 52% Moderate protocol signal Freshness: Hot Status: Ready

Rubric Rating Llm As Judge Medicine

However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows.
We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts.

Open paper

Optimizing In-Context Demonstrations for LLM-based Automated Grading

Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Kevin Haudek, Joseph Krajcik · Feb 28, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating, Demonstrations).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Rubric RatingDemonstrations General

GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.

Open paper

Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright, Marcus Yearwood · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback

Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon General

Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations.

Open paper

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology

Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva, Jesse J Hubbard · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 55% High protocol signal Freshness: Hot Status: Fallback

Rubric RatingExpert Verification Llm As JudgeAutomatic Metrics Medicine

Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.
We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.

Open paper

Confusion-Aware Rubric Optimization for LLM-based Automated Grading

Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Joseph Krajcik, Namsoo Shin · Feb 28, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback

Rubric Rating Automatic Metrics Medicine

Empirical evaluations on teacher education and STEM datasets demonstrate that CARO significantly outperforms existing SOTA methods.

Open paper

IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation

Md Mofijul Islam, Md Sirajus Salekin, Joe King, Priyashree Roy, Vamsi Thilak Gudi, Spencer Romo · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback

Demonstrations Automatic Metrics Coding

We present IDP (Intelligent Document Processing) Accelerator, a framework enabling agentic AI for end-to-end document intelligence with four key components: (1) DocSplit, a novel benchmark dataset and multimodal classifier using BIO tagging…

Open paper

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation

Yun Wang, Xuansheng Wu, Jingyuan Huang, Lei Liu, Xiaoming Zhai, Ninghao Liu · Feb 27, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback

Rubric Rating General

Notably, our method achieves fairness gains comparable to using additional real human data, offering a cost-effective solution for ensuring equitable scoring in large-scale assessments.

Open paper

Decomposing Physician Disagreement in HealthBench

Satya Borgohain, Roy Mariathas · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback

Rubric Rating Medicine

We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it.
The agreement ceiling in medical AI evaluation is thus largely structural, but the reducible/irreducible dissociation suggests that closing information gaps in evaluation scenarios could lower disagreement where inherent clinical ambiguity…

Open paper

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman, Sara Price · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback

Demonstrations General

We introduce AuditBench, an alignment auditing benchmark.
To demonstrate AuditBench's utility, we develop an investigator agent that autonomously employs a configurable set of auditing tools.

Open paper

FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback

Demonstrations General

In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.

Open paper

ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models

Adam Dejl, Deniz Gorur, Francesca Toni · Feb 27, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Demonstrations General

Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by…
Here we propose a web-based system implementing ArgLLM-empowered agents for binary tasks.

Open paper

LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering

Rafid Ishrak Jahan, Fahmid Shahriar Iqbal, Sagnik Ray Choudhury · Feb 27, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Pairwise PreferenceRubric Rating General

We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA.
We propose nine rubrics for answer quality evaluation, and show that simple linear models based on these features perform comparably to state-of-the-art LLM evaluators.

Open paper

Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving

Jiangxin Sun, Feng Xue, Teng Long, Chang Liu, Jian-Fang Hu, Wei-Shi Zheng · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Demonstrations General

Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation.
Furthermore, to generate low-risk candidate actions at test time, we introduce a self-evaluation distillation method to distill riskavoidance capabilities from the well-trained world model into a generative action proposal network without…

Open paper

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

Chungpa Lee, Jy-yong Sohn, Kangwook Lee · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Demonstrations General

Open paper

Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling

Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen, Tianyi Zhang · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Demonstrations General

Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.

Open paper

Protocol Hubs

Expert Verification Papers (39) CS.CL + Expert Verification Papers (30) Rubric Rating Papers (29) CS.CL + Rubric Rating Papers (27) CS.AI + Expert Verification Papers (25) Pairwise Preference Papers (114) CS.CL + Pairwise Preference Papers (99) Coding Papers (91) CS.CL Human Feedback And Eval Papers (1,674) CS.CL + Coding Papers (73) Expert Verification Papers (Last 120 Days) (30) Medicine + Expert Verification Papers (20) Expert Verification Papers (Last 90 Days) (29) Medicine Papers (40) CS.AI + Medicine Papers (25) Automatic Metrics + Expert Verification Papers (25)

Benchmark Hubs

Metric Hubs

Daily Archives

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives