- AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0
Demonstrations Human EvalLlm As Judge Long Horizon
LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
- LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation
Feiyu Duan, Xuanjing Huang, Zhongyu Wei · Mar 12, 2026 · Citations: 0
Pairwise Preference Simulation Env Long Horizon
However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states.
- Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026 · Citations: 0
Pairwise Preference Simulation Env Long Horizon
Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
- SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart · Mar 30, 2026 · Citations: 0
Demonstrations Simulation Env Long Horizon
To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL.
- Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright · Mar 3, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon
Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
- RAPTOR: A Foundation Policy for Quadrotor Control
Jonas Eschmann, Dario Albani, Giuseppe Loianno · Sep 15, 2025 · Citations: 0
Demonstrations Simulation Env Long Horizon
Humans are remarkably data-efficient when adapting to new unseen conditions, like driving a new car.
- ReDAct: Uncertainty-Aware Deferral for LLM Agents
Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov, Nikita Glazkov, Ivan Nasonov · Apr 8, 2026 · Citations: 0
Simulation Env Long Horizon
Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems.
- Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026 · Citations: 0
Simulation Env Long Horizon
We propose GiG, a novel planning framework that structures embodied agents' memory using a Graph-in-Graph architecture.
- From Control to Foresight: Simulation as a New Paradigm for Human-Agent Collaboration
Gaole He, Brian Y. Lim · Mar 12, 2026 · Citations: 0
Pairwise Preference Simulation Env Long Horizon
Large Language Models (LLMs) are increasingly used to power autonomous agents for complex, multi-step tasks.
- MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang · Oct 21, 2025 · Citations: 0
Demonstrations Simulation Env Long Horizon
Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
- DeceptGuard :A Constitutional Oversight Framework For Detecting Deception in LLM Agents
Snehasis Mukhopadhyay · Mar 14, 2026 · Citations: 0
Automatic MetricsSimulation Env Long Horizon
We introduce DECEPTGUARD, a unified framework that systematically compares three monitoring regimes: black-box monitors (actions and outputs only), CoT-aware monitors (additionally observing the agent's chain-of-thought reasoning trace),…
- Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang · Jan 15, 2026 · Citations: 0
Simulation Env Long Horizon
The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanni
- MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue
Naifan Zhang, Ruihan Sun, Jinwei Su, Hengjie Yang, Zhengyuan Pan · Mar 6, 2026 · Citations: 0
Llm As JudgeSimulation Env Long Horizon
We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns.
- SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray · Feb 24, 2026 · Citations: 0
Simulation Env Long Horizon
Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning.
- Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu · Feb 15, 2026 · Citations: 0
Simulation Env Long Horizon
The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge…
- Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications
Che Chen, Lanhua Li, Shimin Gong, Yu Zhao, Yuming Fang · Mar 23, 2026 · Citations: 0
Simulation Env Long Horizon
To maximize the overall throughput, we first propose a delay-tolerant multi-agent deep reinforcement learning (MADRL) algorithm that integrates a delay-penalized reward to encourage information sharing among UAVs, while jointly optimizing…
- AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations
Cheng Jiayang, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao · Mar 2, 2026 · Citations: 0
Simulation Env Long Horizon
Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory.
- Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu · Dec 9, 2025 · Citations: 0
Simulation Env Long Horizon
Extensive experiments on the AerialVLN and OpenFly benchmark validate the effectiveness of our method.
- MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation
Taolin Han, Shuang Wu, Jinghang Wang, Yuhao Zhou, Renquan Lv · Mar 26, 2026 · Citations: 0
Automatic MetricsSimulation Env Long Horizon
Current scientific evaluation benchmarks predominantly rely on static, single-turn Question Answering (QA) formats, which are inadequate for measuring model performance in complex scientific tasks that require multi-step iteration and…
- ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning
Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li · Feb 25, 2026 · Citations: 0
Simulation Env Long Horizon
Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks.
- R-WoM: Retrieval-augmented World Model For Computer-use Agents
Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee · Oct 13, 2025 · Citations: 0
Simulation Env Long Horizon
Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration.
- Contextualized Privacy Defense for LLM Agents
Yule Wen, Yanzhe Zhang, Jianxun Lian, Xiaoyuan Yi, Xing Xie · Mar 3, 2026 · Citations: 0
Simulation Env Long Horizon
LLM agents increasingly act on users' personal information, yet existing privacy defenses remain limited in both design and adaptability.
- LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding · Feb 25, 2026 · Citations: 0
Simulation Env Long Horizon
We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.
- A Survey on the Optimization of Large Language Model-based Agents
Shangheng Du, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xin Jiang · Mar 16, 2025 · Citations: 0
Simulation Env Long Horizon
With the rapid development of Large Language Models (LLMs), LLM-based agents have been widely adopted in various fields, becoming essential for autonomous decision-making and interactive tasks.
- Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving
Yinan Zheng, Tianyi Tan, Bin Huang, Enguang Liu, Ruiming Liang · Feb 26, 2026 · Citations: 0
Simulation Env Long Horizon
However, their applications and evaluations in autonomous driving remain limited to simulation-based or laboratory settings.
- Efficient Hierarchical Any-Angle Path Planning on Multi-Resolution 3D Grids
Victor Reijgwart, Cesar Cadena, Roland Siegwart, Lionel Ott · Feb 24, 2026 · Citations: 0
Simulation Env Long Horizon
Hierarchical, multi-resolution volumetric mapping approaches are widely used to represent large and complex environments as they can efficiently capture their occupancy and connectivity information.
- From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents
Jiaxuan Gao, Jiaao Chen, Chuyi He, Shusheng Xu, Di Jin · Jan 30, 2026 · Citations: 0
Simulation Env Long Horizon
Interactive tool-using agents must solve real-world tasks via multi-turn interaction with both humans and external environments, requiring dialogue state tracking, multi-step tool execution, while following complex instructions.
- Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
Jiawei Chen, Ruoxi Xu, Boxi Cao, Ruotong Pan, Yunfei Zhang · Apr 9, 2026 · Citations: 0
Simulation Env Long Horizon
However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior.
- BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format
Roland Pihlakas, Sruthi Susan Kuriakose · Sep 2, 2025 · Citations: 0
Simulation Env Long Horizon
Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objective (e.g., "paperclip maximiser", specification gaming) at the expense of everything else.