Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 22 Search mode: keyword RSS
IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation

Md Mofijul Islam, Md Sirajus Salekin, Joe King, Priyashree Roy, Vamsi Thilak Gudi, Spencer Romo · Feb 26, 2026

Citations: 0
Demonstrations Automatic Metrics Coding
  • We present IDP (Intelligent Document Processing) Accelerator, a framework enabling agentic AI for end-to-end document intelligence with four key components: (1) DocSplit, a novel benchmark dataset and multimodal classifier using BIO tagging…
Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving

Jiangxin Sun, Feng Xue, Teng Long, Chang Liu, Jian-Fang Hu, Wei-Shi Zheng · Feb 26, 2026

Citations: 0
Demonstrations General
  • Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation.
  • Furthermore, to generate low-risk candidate actions at test time, we introduce a self-evaluation distillation method to distill riskavoidance capabilities from the well-trained world model into a generative action proposal network without…
FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026

Citations: 0
Demonstrations General
  • In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.
Citations: 0
Demonstrations Coding
  • Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts.
  • Drawing inspiration from the theory of human cognitive processes, where inner speech guides action selection before execution, we propose MIMIC (Modeling Inner Motivations for Imitation and Control), a framework that uses language as an…
Demonstrations Coding
  • Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences.
  • Positioned as a "methodological experiment," this study proposes an AI Agent-based collaborative research workflow (Agentic Workflow) for humanities and social science research.
Demonstrations General
  • This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections.
  • Perspectives implements a flexible, aspect-focused document clustering pipeline with human-in-the-loop refinement capabilities.
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework

Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang, Li Qing · Feb 17, 2026

Citations: 0
Demonstrations Automatic Metrics General
  • Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
  • We first define the components and evaluation metrics for TOFs, then formalize a cost-efficient flowchart construction algorithm to abstract procedural knowledge from service dialogues.
AITutor-EvalKit: Exploring the Capabilities of AI Tutors

Numaan Naeem, Kaushal Kumar Maurya, Kseniia Petukhova, Ekaterina Kochmar · Dec 3, 2025

Citations: 0
Demonstrations General
  • We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization.
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation

Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang, Huang Huang · Oct 21, 2025

Citations: 0
Demonstrations Simulation Env Long Horizon General
  • Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
  • This challenge intensifies for multi-step bimanual mobile manipulation, where humans must teleoperate both the mobile base and two high-DoF arms.
SPACeR: Self-Play Anchoring with Centralized Reference Models

Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka, Yihan Hu · Oct 20, 2025

Citations: 0
Demonstrations Simulation Env Multi Agent General
  • Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable.
  • Achieving this requires sim agent policies that are human-like, fast, and scalable in multi-agent settings.
On Discovering Algorithms for Adversarial Imitation Learning

Shashank Reddy Chirra, Jayden Teoh, Praveen Paruchuri, Pradeep Varakantham · Oct 1, 2025

Citations: 0
Demonstrations Simulation Env Coding
  • RA functions in AIL are typically derived from divergence minimization objectives, relying heavily on human design and ingenuity.
  • Remarkably, DAIL generalises across unseen environments and policy optimization algorithms, outperforming the current state-of-the-art of \emph{human-designed} baselines.
Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs

Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel, Jakob Foerster · Jun 23, 2025

Citations: 0
Demonstrations Coding
  • Though execution of instructions in training data remains less reliable than when instructions are given in-context, our results demonstrate that procedural knowledge can be noisily `programmed' into LLMs through PBB, with important…

Protocol Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.