Skip to content
← Back to explorer

Tag: Long Horizon

Measures multi-step performance across long-horizon tasks.

Papers in tag: 74

Research Utility Snapshot

Evaluation Modes

  • Automatic Metrics (14)
  • Simulation Env (6)

Human Feedback Types

  • Insufficient signal

Required Expertise

  • General (11)
  • Coding (7)
  • Math (2)
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng · Feb 25, 2026 · Citations: 0

Automatic Metrics Coding
  • Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
  • This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents.
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning

Tomoya Kawabe, Rin Takano · Feb 25, 2026 · Citations: 0

Automatic Metrics General
  • We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
  • When plans fail, the system applies TextGrad-inspired textual-gradient updates to optimize each agent's prompt and thereby improve planning accuracy.
Self-Correcting VLA: Online Action Refinement via Sparse World Imagination

Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li, Guoli Yang · Feb 25, 2026 · Citations: 0

Simulation Env Coding
  • Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states.
  • Evaluations on challenging robot manipulation tasks from simulation benchmarks and real-world settings demonstrate that SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher s
Structurally Aligned Subtask-Level Memory for Software Engineering Agents

Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026 · Citations: 0

Automatic Metrics Coding
  • Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
  • Recent work has further explored augmenting these agents with memory mechanisms to support long-horizon reasoning.
ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li, Kaiqiao Han · Feb 25, 2026 · Citations: 0

Simulation Env General
  • Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks.
  • Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL.
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding, Gedas Bertasius · Feb 25, 2026 · Citations: 0

Simulation Env General
  • We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.
  • Furthermore, real-world evaluations across 8 long-horizon tasks demonstrate an average success rate of 85%.
VecGlypher: Unified Vector Glyph Generation with Language Models

Xiaoke Huang, Bhavul Gauri, Kam Woh Ng, Tony Ng, Mengmeng Xu, Zhiheng Liu · Feb 25, 2026 · Citations: 0

Automatic Metrics General
  • On cross-family OOD evaluation, VecGlypher substantially outperforms both general-purpose LLMs and specialized vector-font baselines for text-only generation, while image-referenced generation reaches a state-of-the-art performance, with ma
Provably Safe Generative Sampling with Constricting Barrier Functions

Darshan Gadginmath, Ahmed Allibhoy, Fabio Pasqualetti · Feb 24, 2026 · Citations: 0

Automatic Metrics General
  • However, a critical gap remains for their deployment in safety-critical domains: the lack of formal guarantees that generated samples will satisfy hard constraints.
  • We address this by proposing a safety filtering framework that acts as an online shield for any pre-trained generative model.
A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives

Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung, Nikolay Koldunov · Feb 24, 2026 · Citations: 0

Automatic Metrics Coding
  • Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis.
  • Unlike standard Large Language Model (LLM) wrappers, our architecture implements a centralized Supervisor-Worker topology with strict data-type-aware routing, sandboxed deterministic code execution, and self-correction via execution feedbac
Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu, Yejin Choi · Feb 24, 2026 · Citations: 0

Automatic Metrics General
  • Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidat
  • We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment.
SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray, Hua Wei · Feb 24, 2026 · Citations: 0

Automatic Metrics General
  • Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning.
  • We introduce SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang · Feb 24, 2026 · Citations: 0

Simulation Env LawCoding
  • Agentic systems increasingly rely on reusable procedural capabilities, \textit{a.k.a., agentic skills}, to execute long-horizon workflows reliably.
  • This paper maps the skill layer across the full lifecycle (discovery, practice, distillation, storage, composition, evaluation, and update) and introduces two complementary taxonomies.
ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning

Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee · Feb 24, 2026 · Citations: 0

Simulation Env Math
  • We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution.
  • It turns math problems into a controlled, correctness-checkable benchmark with tool sets, enabling systematic evaluation of model reliability under (1) large, overlapping tool catalogs and (2) the absence of the intended capability.
ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

Che Wang, Fuyao Zhang, Jiaming Zhang, Ziqi Zhang, Yinghui Wang, Longtao Huang · Feb 24, 2026 · Citations: 0

Automatic Metrics General
  • Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution.
  • Existing defenses typically rely on strict filtering or refusal mechanisms, which suffer from a critical limitation: over-refusal, prematurely terminating valid agentic workflows.
Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics

Yue Pan, Xingyao Wang, Hanyue Zhang, Liwei Liu, Changxin Li, Gang Yang · Feb 23, 2026 · Citations: 0

Automatic Metrics MedicineCoding
  • The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings.