- AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0
Demonstrations Human EvalLlm As Judge
LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
- When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation
Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou · Apr 1, 2026 · Citations: 0
Critique Edit Simulation Env
As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution…
- Go-Browse: Training Web Agents with Structured Exploration
Apurva Gandhi, Graham Neubig · Jun 4, 2025 · Citations: 0
Simulation Env
To address this, we propose Go-Browse, a method for automatically collecting diverse and realistic web agent data at scale through structured exploration of web environments.
- Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu · Feb 15, 2026 · Citations: 0
Simulation Env
The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge…
- Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
Xuan Qi · Apr 2, 2026 · Citations: 0
Automatic Metrics
Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood.
- R-WoM: Retrieval-augmented World Model For Computer-use Agents
Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee · Oct 13, 2025 · Citations: 0
Simulation Env
Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration.
- SkillX: Automatically Constructing Skill Knowledge Bases for Agents
Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang · Apr 6, 2026 · Citations: 0
Automatic Metrics
Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self-evolving paradigms remain inefficient: agents learn in isolation, repeatedly rediscover similar behaviors from limited…
- The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check
Qingyu Lu, Liang Ding, Kanjian Zhang, Jinxia Zhang, Dacheng Tao · Jan 19, 2026 · Citations: 0
Automatic Metrics
In this work, we present a comprehensive evaluation of dLLMs (e.g., LLaDA, Dream) across two distinct agentic paradigms: Embodied Agents (requiring long-horizon planning) and Tool-Calling Agents (requiring precise formatting).
- Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions
Junhao Su, Yuanliang Wan, Junwei Yang, Hengyu Shi, Tianyang Han · Sep 23, 2025 · Citations: 0
Automatic Metrics
The agent produces a short yet precise reflection: it diagnoses the failure using evidence from the previous step and then proposes a correct, executable follow-up call.
- Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
Yannis Belkhiter, Giulio Zizzo, Sergio Maffeis, Seshu Tirupathi, John D. Kelleher · Apr 22, 2026 · Citations: 0
- CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution
Shidong Yang, Ziyu Ma, Tongwen Huang, Yiming Hu, Yong Wang · Apr 17, 2026 · Citations: 0
- WebXSkill: Skill Learning for Autonomous Web Agents
Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang, Wenlin Yao · Apr 14, 2026 · Citations: 0
- Awakening the Sleeping Agent: Lean-Specific Agentic Data Reactivates General Tool Use in Goedel Prover
Jui-Hui Chung, Hongzhou Lin, Lai Jiang, Shange Tang, Chi Jin · Apr 9, 2026 · Citations: 0
- Learning to Learn-at-Test-Time: Language Agents with Learnable Adaptation Policies
Zhanzhi Lou, Hui Chen, Yibo Li, Qian Wang, Bryan Hooi · Apr 1, 2026 · Citations: 0
- AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation
Liang Ding · Mar 22, 2026 · Citations: 0
- AI Planning Framework for LLM-Based Web Agents
Orit Shahnovsky, Rotem Dror · Mar 13, 2026 · Citations: 0
- PostTrainBench: Can LLM Agents Automate LLM Post-Training?
Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen · Mar 9, 2026 · Citations: 0
- WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents
Yao Zhang, Shijie Tang, Zeyu Li, Zhen Han, Volker Tresp · Jan 29, 2026 · Citations: 0
- Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents
Kaiyu Zhou, Yongsen Zheng, Yicheng He, Meng Xue, Xueluan Gong · Jan 16, 2026 · Citations: 0
- Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution
Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu · Dec 11, 2025 · Citations: 0