Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 83 Search mode: keyword RSS
Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving

Jiangxin Sun, Feng Xue, Teng Long, Chang Liu, Jian-Fang Hu, Wei-Shi Zheng · Feb 26, 2026

Citations: 0
Demonstrations General
  • Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation.
  • Furthermore, to generate low-risk candidate actions at test time, we introduce a self-evaluation distillation method to distill riskavoidance capabilities from the well-trained world model into a generative action proposal network without…
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding, Miao Zhang · Feb 26, 2026

Citations: 0
Automatic Metrics Multi Agent MathCoding
  • We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining.
  • Empirical results on extensive math benchmarks show that AgentDropoutV2 significantly boosts the MAS's task performance, achieving an average accuracy gain of 6.3 percentage points on math benchmarks.
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng, Ismail Elezi · Feb 26, 2026

Citations: 0
Automatic Metrics Long Horizon MathCoding
  • This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
  • Across math reasoning benchmarks, we find that step-level recombination is most beneficial on harder problems, and ablations highlight the importance of the final AR solver in converting stitched but imperfect rationales into accurate…
Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, Bo An · Feb 26, 2026

Citations: 0
Simulation Env Long Horizon Coding
  • Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks.
  • To address the issue, in this paper, we propose Hierarchy-of-Groups Policy Optimization (HGPO) for long-horizon agentic tasks.
Critique Edit Coding
  • NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code.
  • All conceptual framing, methodological claims, and final revisions were directed, reviewed, and approved by the human author under a documented human-in-the-loop protocol.
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models

Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng, Kyle Lam · Feb 25, 2026

Citations: 0
Expert Verification Automatic Metrics MedicineCoding
  • Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
  • We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case.
FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026

Citations: 0
Demonstrations General
  • In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video

Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao, Yibing Fu · Feb 25, 2026

Citations: 0
Expert Verification Automatic Metrics MedicineCoding
  • Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
  • We introduce ResGo, a benchmark of laparoscopic frames annotated with Go Zone bounding boxes and clinician-authored rationales covering phase, exposure quality reasoning, next action and risk reminder.
Self-Correcting VLA: Online Action Refinement via Sparse World Imagination

Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li, Guoli Yang · Feb 25, 2026

Citations: 0
Simulation Env Long Horizon Coding
  • Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states.
  • Evaluations on challenging robot manipulation tasks from simulation benchmarks and real-world settings demonstrate that SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher s
Structurally Aligned Subtask-Level Memory for Software Engineering Agents

Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026

Citations: 0
Automatic Metrics Long Horizon Coding
  • Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
  • Recent work has further explored augmenting these agents with memory mechanisms to support long-horizon reasoning.
A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives

Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung, Nikolay Koldunov · Feb 24, 2026

Citations: 0
Automatic Metrics Long Horizon Coding
  • Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis.
  • Unlike standard Large Language Model (LLM) wrappers, our architecture implements a centralized Supervisor-Worker topology with strict data-type-aware routing, sandboxed deterministic code execution, and self-correction via execution feedbac
A Benchmark for Deep Information Synthesis

Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov, Lena Sophia Bolliger · Feb 24, 2026

Citations: 0
Automatic Metrics Tool Use Coding
  • To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights.
  • When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 score of 8.97 and 17.5 on the LLM-judge metric, underscoring the difficulty of the benchmark.
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery

David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026

Citations: 0
Expert Verification Automatic Metrics Multi Agent Coding
  • Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
  • The code, datasets, and evaluation protocols for SparkMe are available as open-source at https://github.com/SALT-NLP/SparkMe.
Citations: 0
Demonstrations Coding
  • Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts.
  • Drawing inspiration from the theory of human cognitive processes, where inner speech guides action selection before execution, we propose MIMIC (Modeling Inner Motivations for Imitation and Control), a framework that uses language as an…

Protocol Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.