- AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0
Demonstrations Human EvalLlm As Judge
LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
- Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification
Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav · Jul 15, 2025 · Citations: 0
Pairwise Preference Automatic MetricsSimulation Env
We evaluate MLLM verifiers across web navigation, computer use, and robotics, spanning 13+ models, 28+ designs, and thousands of trajectories from diverse agents.
- When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation
Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou · Apr 1, 2026 · Citations: 0
Critique Edit Simulation Env
As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution…
- ReDAct: Uncertainty-Aware Deferral for LLM Agents
Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov, Nikita Glazkov, Ivan Nasonov · Apr 8, 2026 · Citations: 0
Simulation Env
Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems.
- Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026 · Citations: 0
Simulation Env
We propose GiG, a novel planning framework that structures embodied agents' memory using a Graph-in-Graph architecture.
- Go-Browse: Training Web Agents with Structured Exploration
Apurva Gandhi, Graham Neubig · Jun 4, 2025 · Citations: 0
Simulation Env
To address this, we propose Go-Browse, a method for automatically collecting diverse and realistic web agent data at scale through structured exploration of web environments.
- BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Shan · Feb 13, 2026 · Citations: 0
Automatic MetricsSimulation Env
Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments.
- Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li · May 20, 2025 · Citations: 0
Demonstrations Simulation Env
Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.
- World-Model-Augmented Web Agents with Action Correction
Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li · Feb 17, 2026 · Citations: 0
Llm As JudgeSimulation Env
To address these challenges, we propose WAC, a web agent that integrates model collaboration, consequence simulation, and feedback-driven action refinement.
- Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks
Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng · Feb 26, 2026 · Citations: 0
Simulation Env
Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks.
- SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray · Feb 24, 2026 · Citations: 0
Simulation Env
Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning.
- Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu · Feb 15, 2026 · Citations: 0
Simulation Env
The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge…
- R-WoM: Retrieval-augmented World Model For Computer-use Agents
Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee · Oct 13, 2025 · Citations: 0
Simulation Env
Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration.
- Reward Prediction with Factorized World States
Yijun Shen, Delong Chen, Xianming Hu, Jiaming Mi, Hongbo Zhao · Mar 10, 2026 · Citations: 0
Llm As JudgeSimulation Env
We evaluate on RewardPrediction, a new benchmark dataset spanning five diverse domains and comprising 2,454 unique action-observation trajectories with step-wise ground-truth rewards.