- Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
- How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space.
- GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
- SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
- Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Tomoya Kawabe, Rin Takano · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
- Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li · Feb 25, 2026 · Citations: 0
Simulation Env Long Horizon
Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states.
- Structurally Aligned Subtask-Level Memory for Software Engineering Agents
Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
- ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning
Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li · Feb 25, 2026 · Citations: 0
Simulation Env Long Horizon
Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks.
- LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding · Feb 25, 2026 · Citations: 0
Simulation Env Long Horizon
We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.
- Provably Safe Generative Sampling with Constricting Barrier Functions
Darshan Gadginmath, Ahmed Allibhoy, Fabio Pasqualetti · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
However, a critical gap remains for their deployment in safety-critical domains: the lack of formal guarantees that generated samples will satisfy hard constraints.
- A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives
Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis.
- Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidat
- Efficient Hierarchical Any-Angle Path Planning on Multi-Resolution 3D Grids
Victor Reijgwart, Cesar Cadena, Roland Siegwart, Lionel Ott · Feb 24, 2026 · Citations: 0
Simulation Env Long Horizon
Hierarchical, multi-resolution volumetric mapping approaches are widely used to represent large and complex environments as they can efficiently capture their occupancy and connectivity information.
- SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang · Feb 24, 2026 · Citations: 0
Simulation Env Tool Use
Agentic systems increasingly rely on reusable procedural capabilities, \textit{a.k.a., agentic skills}, to execute long-horizon workflows reliably.
- ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction
Che Wang, Fuyao Zhang, Jiaming Zhang, Ziqi Zhang, Yinghui Wang · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution.
- Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics
Yue Pan, Xingyao Wang, Hanyue Zhang, Liwei Liu, Changxin Li · Feb 23, 2026 · Citations: 0
Automatic Metrics Long Horizon
The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings.
- Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song · Feb 23, 2026 · Citations: 0
Automatic Metrics Long Horizon
We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains.
- Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations
Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore · Feb 22, 2026 · Citations: 0
Automatic Metrics Long Horizon
Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows.
- Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Long Horizon
Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
- Semantic Substrate Theory: An Operator-Theoretic Framework for Geometric Semantic Drift
Stephen Russell · Feb 21, 2026 · Citations: 0
Automatic Metrics Long Horizon
Most semantic drift studies report multiple signals e.g., embedding displacement, neighbor changes, distributional divergence, and recursive trajectory instability, without a shared explanatory theory that relates them.
- Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar · Feb 20, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Long Horizon
When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed.
- Sink-Aware Pruning for Diffusion Language Models
Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen · Feb 19, 2026 · Citations: 0
Automatic Metrics Long Horizon
Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning.
- KLong: Training LLM Agent for Extremely Long-horizon Tasks
Yue Liu, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi · Feb 19, 2026 · Citations: 0
Rubric Rating Automatic Metrics Long Horizon
This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks.
- Creating a digital poet
Vered Tohar, Tsahi Hayat, Amir Leshem · Feb 18, 2026 · Citations: 0
Automatic Metrics Long Horizon
In a blinded authorship test with 50 humanities students and graduates (three AI poems and three poems by well-known poets each), judgments were at chance: human poems were labeled human 54% of the time and AI poems 52%, with 95% confidence
- OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction
Skyler Hallinan, Thejas Venkatesh, Xiang Ren, Sai Praneeth Karimireddy, Ashwin Paranjape · Feb 16, 2026 · Citations: 0
Simulation Env Tool Use
Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks.
- A Geometric Analysis of Small-sized Language Model Hallucinations
Emanuele Ricco, Elia Onofri, Lorenzo Cima, Stefano Cresci, Roberto Di Pietro · Feb 16, 2026 · Citations: 0
Automatic Metrics Long Horizon
Hallucinations -- fluent but factually incorrect responses -- pose a major challenge to the reliability of language models, especially in multi-step or agentic settings.
- Unlocking Reasoning Capability on Machine Translation in Large Language Models
Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio · Feb 16, 2026 · Citations: 0
Critique Edit Automatic Metrics Long Horizon
We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.
- PMG: Parameterized Motion Generator for Human-like Locomotion Control
Chenxi Han, Yuheng Min, Zihao Huang, Ao Hong, Hang Liu · Feb 13, 2026 · Citations: 0
Automatic Metrics Long Horizon
Recent advances in data-driven reinforcement learning and motion tracking have substantially improved humanoid locomotion, yet critical practical challenges remain.
- Think like a Scientist: Physics-guided LLM Agent for Equation Discovery
Jianke Yang, Ohm Venkatachalam, Mohammad Kianezhad, Sharvaree Vadgama, Rose Yu · Feb 12, 2026 · Citations: 0
Automatic Metrics Long Horizon
We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process.
- TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents
Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Heiko Ludwig, Holger Boche · Feb 12, 2026 · Citations: 0
Simulation Env Long Horizon
Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks.
- UI-Venus-1.5 Technical Report
Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu · Feb 9, 2026 · Citations: 0
Simulation Env Long Horizon
GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.
- AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Siheng Chen · Feb 8, 2026 · Citations: 0
Automatic Metrics Long Horizon
Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons.
- Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization
Jingyi Xu, Xingyu Ren, Zhoupeng Shou, Yumeng Zhang, Zhiqiang You · Jan 24, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Long Horizon
Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success.
- APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0
Rubric RatingExpert Verification Simulation Env Long Horizon
We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate law
- Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026 · Citations: 0
Automatic Metrics Long Horizon
Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
- Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang · Jan 15, 2026 · Citations: 0
Simulation Env Long Horizon
The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanni
- Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026 · Citations: 0
Pairwise Preference Simulation Env Long Horizon
Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
- Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu · Dec 9, 2025 · Citations: 0
Simulation Env Long Horizon
Extensive experiments on the AerialVLN and OpenFly benchmark validate the effectiveness of our method.
- Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors
Qiming Bao, Xiaoxuan Fu, Michael Witbrock · Dec 6, 2025 · Citations: 0
Automatic Metrics Long Horizon
We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs.
- Stabilizing Off-Policy Training for Long-Horizon LLM Agent via Turn-Level Importance Sampling and Clipping-Triggered Normalization
Chenliang Li, Adel Elmahdy, Alex Boyd, Zhongruo Wang, Siliang Zeng · Nov 25, 2025 · Citations: 0
Automatic Metrics Long Horizon
Reinforcement learning (RL) algorithms such as PPO and GRPO are widely used to train large language models (LLMs) for multi-turn agentic tasks.
- Bridging Symbolic Control and Neural Reasoning in LLM Agents: Structured Cognitive Loop with a Governance Layer
Myung Ho Kim · Nov 21, 2025 · Citations: 0
Automatic Metrics Long Horizon
Large language model agents suffer from fundamental architectural problems: entangled reasoning and execution, memory volatility, and uncontrolled action sequences.
- Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces
Shreyas Rajesh, Pavan Holur, Chenda Duan, David Chong, Vwani Roychowdhury · Nov 10, 2025 · Citations: 0
Automatic Metrics Long Horizon
On the Episodic Memory Benchmark (EpBench) \cite{huet_episodic_2025} comprising corpora ranging from 100k to 1M tokens in length, GSW outperforms existing RAG based baselines by up to \textbf{20\%}.
- BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen · Oct 31, 2025 · Citations: 0
Pairwise Preference Automatic MetricsSimulation Env Long Horizon
Recent advances in Vision-Language Models (VLMs) have propelled embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs.
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han · Oct 29, 2025 · Citations: 0
Demonstrations Automatic Metrics Long Horizon
Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
- The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu · Oct 29, 2025 · Citations: 0
Simulation Env Long Horizon
Real-world language agents must handle complex, multi-step workflows across diverse Apps.
- World Simulation with Video Foundation Models for Physical AI
NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala · Oct 28, 2025 · Citations: 0
Simulation Env Long Horizon
These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
- RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA
Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025 · Citations: 0
Automatic Metrics Long Horizon
A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes can
- MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang · Oct 21, 2025 · Citations: 0
Demonstrations Simulation Env Long Horizon
Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
- RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility
Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang · Sep 27, 2025 · Citations: 0
Automatic Metrics Long Horizon
Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors.
- Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models
Yunqing Liu, Nan Zhang, Zhiming Tan · Sep 1, 2025 · Citations: 0
Pairwise Preference Automatic Metrics Long Horizon
We additionally contribute a CAD dataset with human preference annotations.
- EO-1: An Open Unified Embodied Foundation Model for General Robot Control
Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao · Aug 28, 2025 · Citations: 0
Automatic Metrics Long Horizon
The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general purpose embodied intelligent systems.
- Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning
Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee · Aug 26, 2025 · Citations: 0
Automatic Metrics Long Horizon
Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval.
- iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering
Shuai Wang, Yinan Yu · Jun 2, 2025 · Citations: 0
Automatic Metrics Long Horizon
Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.
- A Survey on the Optimization of Large Language Model-based Agents
Shangheng Du, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xin Jiang · Mar 16, 2025 · Citations: 0
Simulation Env Long Horizon
With the rapid development of Large Language Models (LLMs), LLM-based agents have been widely adopted in various fields, becoming essential for autonomous decision-making and interactive tasks.
- Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation
Safeyah Khaled Alshemali, Daniel Bauer, Yuval Marton · Oct 19, 2024 · Citations: 0
Automatic Metrics Long Horizon
We set a new state-of-the-art on thematic fit benchmarks, but show that closed and open weight LLMs respond differently to our prompting strategies: Closed models achieve better scores overall and benefit from multi-step reasoning, but they