Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 63 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, Caiwen Ding · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 55% High protocol signal Freshness: Hot Status: Ready
Rubric Rating Automatic Metrics Multi Agent Coding
  • To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it…
  • Experiments on KernelBench show that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72x better speedup over the multi-agent baseline and 2.73x than the RL model baselines.
Open paper
Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation

Harry Stuart, Masahiro Kaneko, Timothy Baldwin · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 55% High protocol signal Freshness: Hot Status: Ready
Rubric Rating Automatic Metrics Coding
  • Effective hiring is integral to the success of an organisation, but it is very challenging to find the most suitable candidates because expert evaluation (e.g.\ interviews conducted by a technical manager) are expensive to deploy at scale.
Open paper
RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning

Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li, Zhen Qin · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready
Rubric Rating Automatic Metrics General
  • Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.
Open paper
TimeWarp: Evaluating Web Agents by Revisiting the Past

Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 52% Moderate protocol signal Freshness: Hot Status: Ready
Demonstrations Web Browsing General
  • The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
  • We introduce TimeWarp, a benchmark that emulates the evolving web using containerized environments that vary in UI, design, and layout.
Open paper
ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li, Chuanmiao Yan · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 52% Moderate protocol signal Freshness: Hot Status: Ready
Rubric Rating Llm As Judge Medicine
  • However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows.
  • We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts.
Open paper
Optimizing In-Context Demonstrations for LLM-based Automated Grading

Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Kevin Haudek, Joseph Krajcik · Feb 28, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating, Demonstrations).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback
Rubric RatingDemonstrations General
  • GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.
Open paper
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright, Marcus Yearwood · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback
Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon General
  • Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
  • We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations.
Open paper
PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology

Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva, Jesse J Hubbard · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 55% High protocol signal Freshness: Hot Status: Fallback
Rubric RatingExpert Verification Llm As JudgeAutomatic Metrics Medicine
  • Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.
  • We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
Open paper
Confusion-Aware Rubric Optimization for LLM-based Automated Grading

Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Joseph Krajcik, Namsoo Shin · Feb 28, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback
Rubric Rating Automatic Metrics Medicine
  • Empirical evaluations on teacher education and STEM datasets demonstrate that CARO significantly outperforms existing SOTA methods.
Open paper
IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation

Md Mofijul Islam, Md Sirajus Salekin, Joe King, Priyashree Roy, Vamsi Thilak Gudi, Spencer Romo · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback
Demonstrations Automatic Metrics Coding
  • We present IDP (Intelligent Document Processing) Accelerator, a framework enabling agentic AI for end-to-end document intelligence with four key components: (1) DocSplit, a novel benchmark dataset and multimodal classifier using BIO tagging…
Open paper
BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation

Yun Wang, Xuansheng Wu, Jingyuan Huang, Lei Liu, Xiaoming Zhai, Ninghao Liu · Feb 27, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback
Rubric Rating General
  • Notably, our method achieves fairness gains comparable to using additional real human data, offering a cost-effective solution for ensuring equitable scoring in large-scale assessments.
Open paper
Decomposing Physician Disagreement in HealthBench

Satya Borgohain, Roy Mariathas · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback
Rubric Rating Medicine
  • We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it.
  • The agreement ceiling in medical AI evaluation is thus largely structural, but the reducible/irreducible dissociation suggests that closing information gaps in evaluation scenarios could lower disagreement where inherent clinical ambiguity…
Open paper
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman, Sara Price · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback
Demonstrations General
  • We introduce AuditBench, an alignment auditing benchmark.
  • To demonstrate AuditBench's utility, we develop an investigator agent that autonomously employs a configurable set of auditing tools.
Open paper
FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback
Demonstrations General
  • In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.
Open paper
Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback
Demonstrations General
  • Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by…
  • Here we propose a web-based system implementing ArgLLM-empowered agents for binary tasks.
Open paper
LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering

Rafid Ishrak Jahan, Fahmid Shahriar Iqbal, Sagnik Ray Choudhury · Feb 27, 2026

Citations: 0

Match reason: Matches selected tags (Rubric Rating).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback
Pairwise PreferenceRubric Rating General
  • We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA.
  • We propose nine rubrics for answer quality evaluation, and show that simple linear models based on these features perform comparably to state-of-the-art LLM evaluators.
Open paper
Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving

Jiangxin Sun, Feng Xue, Teng Long, Chang Liu, Jian-Fang Hu, Wei-Shi Zheng · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback
Demonstrations General
  • Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation.
  • Furthermore, to generate low-risk candidate actions at test time, we introduce a self-evaluation distillation method to distill riskavoidance capabilities from the well-trained world model into a generative action proposal network without…
Open paper
Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback
Demonstrations General
  • Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
Open paper

Protocol Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.