Chaoyue He, Xin Zhou, Xinjia Yu, Lei Zhang, Yan Zhang, Yi Wu · Feb 28, 2026
Researcher Tools
Human Feedback and Eval Paper Explorer
A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.
Filter by tag
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding, Miao Zhang · Feb 26, 2026
- We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining.
- Empirical results on extensive math benchmarks show that AgentDropoutV2 significantly boosts the MAS's task performance, achieving an average accuracy gain of 6.3 percentage points on math benchmarks.
Dimitrios P. Panagoulias, Evangelia-Aikaterini Tsichrintzi, Georgios Savvidis, Evridiki Tsoureli-Nikita · Feb 26, 2026
- Human-in-the-loop validation is essential in safety-critical clinical AI, yet the transition between initial model inference and expert correction is rarely analyzed as a structured signal.
- Evaluation on 21 dermatological cases (21 complete AI physician pairs) em- ployed a four-level concordance framework comprising exact primary match rate (PMR), semantic similarity-adjusted rate (AMR), cross-category alignment, and…
Shentong Mo, Xufang Luo, Dongsheng Li · Feb 26, 2026
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng, Ismail Elezi · Feb 26, 2026
- This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
- Across math reasoning benchmarks, we find that step-level recombination is most beneficial on harder problems, and ablations highlight the importance of the final AR solver in converting stitched but imperfect rationales into accurate…
Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, Bo An · Feb 26, 2026
- Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks.
- To address the issue, in this paper, we propose Hierarchy-of-Groups Policy Optimization (HGPO) for long-horizon agentic tasks.
Hyunwoo Kim, Hanau Yi, Jaehee Bae, Yumin Kim · Feb 26, 2026
- NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code.
- All conceptual framing, methodological claims, and final revisions were directed, reviewed, and approved by the human author under a documented human-in-the-loop protocol.
Joydeep Chandra, Satyam Kumar Navneet, Yong Zhang · Feb 26, 2026
- As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness…
- We introduce TherapyProbe, a design probe methodology that generates actionable design knowledge by systematically exploring chatbot conversation trajectories through adversarial multi-agent simulation.
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng · Feb 25, 2026
- Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
- Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion.
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt, Zijian Wang · Feb 25, 2026
- Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng, Kyle Lam · Feb 25, 2026
- Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
- We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case.
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao, Yibing Fu · Feb 25, 2026
- Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
- We introduce ResGo, a benchmark of laparoscopic frames annotated with Go Zone bounding boxes and clinician-authored rationales covering phase, exposure quality reasoning, next action and risk reminder.
Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li, Guoli Yang · Feb 25, 2026
- Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states.
- Evaluations on challenging robot manipulation tasks from simulation benchmarks and real-world settings demonstrate that SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher s
Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026
- Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
- Recent work has further explored augmenting these agents with memory mechanisms to support long-horizon reasoning.
Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung, Nikolay Koldunov · Feb 24, 2026
- Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis.
- Unlike standard Large Language Model (LLM) wrappers, our architecture implements a centralized Supervisor-Worker topology with strict data-type-aware routing, sandboxed deterministic code execution, and self-correction via execution feedbac
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov, Lena Sophia Bolliger · Feb 24, 2026
- To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights.
- When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 score of 8.97 and 17.5 on the LLM-judge metric, underscoring the difficulty of the benchmark.
David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026
- Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
- The code, datasets, and evaluation protocols for SparkMe are available as open-source at https://github.com/SALT-NLP/SparkMe.
Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng, Wei Dong · Feb 24, 2026
- Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare.
- However, this deepening trust introduces a novel attack surface: Agent-Mediated Deception (AMD), where compromised agents are weaponized against their human users.
Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, Harmanpreet Kaur · Feb 24, 2026
- Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in…
- We validated this schema through contextual inquiries with 10 additional scientists, which showed not only which errors experts naturally identify but also how structured evaluation schemas can help them detect previously overlooked issues.
Peter Hase, Christopher Potts · Feb 24, 2026
Protocol Hubs
Benchmark Hubs
Metric Hubs
- Accuracy & Pass Rate Metric Papers (88)
- Accuracy Metric Papers (82)
- Accuracy & Pass Rate Metric Papers In CS.CL (63)
- Accuracy & Pass Rate Metric Papers + Automatic Metrics (74)
- Accuracy In CS.CL Papers (58)
- Accuracy & Pass Rate Metric Papers In CS.AI (58)
- Accuracy + Automatic Metrics Metric Papers (70)
- Accuracy + Automatic Metrics Metric Papers (Last 120 Days) (53)
- Accuracy + Automatic Metrics Metric Papers (Last 90 Days) (51)
- Accuracy + Automatic Metrics Metric Papers (Last 30 Days) (47)
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.