Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 31 Search mode: keyword RSS
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents

Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu, Liqun Liu · Feb 15, 2026

Citations: 0
Expert Verification Simulation Env Long Horizon Coding
  • While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem.
  • To address this gap, we propose AD-Bench, a benchmark designed based on real-world business requirements of advertising and marketing platforms.
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li · Feb 12, 2026

Citations: 0
Automatic Metrics Tool Use Coding
  • To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.
  • To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional…
Probing Graph Neural Network Activation Patterns Through Graph Topology

Floriano Tori, Lorenzo Bini, Marco Sorbi, Stéphane Marchand-Maillet, Vincent Ginis · Feb 24, 2026

Citations: 0
Pairwise Preference Automatic Metrics General
  • However, it remains unclear how the topology of a graph interacts with the learned preferences of GNNs.
  • Our findings on synthetic graphs and molecular benchmarks reveal that MAs do not preferentially concentrate on curvature extremes, despite their theoretical link to information flow.
APEX-Agents

Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein · Jan 20, 2026

Citations: 0
Rubric RatingExpert Verification Automatic Metrics Long Horizon Law
  • We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate…
  • We test eight agents for the leaderboard using Pass@1.
Architecting AgentOS: From Token-Level Context to Emergent System-Level Intelligence

ChengYou Li, XiaoDong Liu, XiangBao Meng, XinYu Zhao · Feb 24, 2026

Citations: 0
Simulation Env Multi Agent General
  • The paradigm of Large Language Models is undergoing a fundamental transition from static inference engines to dynamic autonomous cognitive systems.While current research primarily focuses on scaling context windows or optimizing prompt engi
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition

Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang, Frank Ong · Apr 26, 2025

Citations: 0
Pairwise Preference Automatic Metrics Multi Agent General
  • These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
  • We propose MATCHA, a multi-agent framework for CRS that assigns specialized agents for intent parsing, tool-augmented retrieval, multi-LLM ranking with reflection, explanation, and risk control which enabling finer personalization,…
Structurally Aligned Subtask-Level Memory for Software Engineering Agents

Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026

Citations: 0
Automatic Metrics Long Horizon Coding
  • Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
  • Recent work has further explored augmenting these agents with memory mechanisms to support long-horizon reasoning.
From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity

Tianxi Wan, Jiaming Luo, Siyuan Chen, Kunyao Lan, Jianhua Chen, Haiyang Geng · Oct 29, 2025

Citations: 0
Automatic Metrics Multi Agent Medicine
  • To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation.
  • Our multi-agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards.
MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation

Hsin-Ling Hsu, Cong-Tinh Dao, Luning Wang, Zitao Shuai, Thao Nguyen Minh Phan, Jun-En Ding · Mar 23, 2025

Citations: 0
Expert Verification Automatic Metrics Medicine
  • Comprehensive evaluation demonstrates that our method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.
RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA

Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025

Citations: 0
Automatic Metrics Long Horizon General
  • A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes…
  • Experiments on HotpotQA (text), HybridQA/TAT-QA (table+text), and MetaQA (KG) show consistent EM/F1 gains over strong single-pass, multi-hop, and agentic RAG baselines with high efficiency.
Citations: 0
Simulation Env Long Horizon General
  • We propose GiG, a novel planning framework that structures embodied agents' memory using a Graph-in-Graph architecture.
  • Furthermore, we introduce a novel bounded lookahead module that leverages symbolic transition logic to enhance the agents' planning capabilities through the grounded action projection.

Protocol Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.