- AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0
Demonstrations Human EvalLlm As Judge Long Horizon
LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
- TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026 · Citations: 0
Red Team Automatic Metrics Long Horizon
As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
- PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning
Bingxuan Li, Jeonghwan Kim, Cheng Qian, Xiusi Chen, Eitan Anzenberg · Jan 17, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Long Horizon
To enable a systematic study of this question, we introduce CalConflictBench, a benchmark for long-horizon calendar conflict resolution.
- LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation
Feiyu Duan, Xuanjing Huang, Zhongyu Wei · Mar 12, 2026 · Citations: 0
Pairwise Preference Simulation Env Long Horizon
However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states.
- SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart · Mar 30, 2026 · Citations: 0
Demonstrations Simulation Env Long Horizon
To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL.
- Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright · Mar 3, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon
Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
- Signals: Trajectory Sampling and Triage for Agentic Interactions
Shuguang Chen, Adil Hafeez, Salman Paracha · Apr 1, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Long Horizon
We propose a lightweight, signal-based framework for triaging agentic interaction trajectories.
- Learning When to Act: Interval-Aware Reinforcement Learning with Predictive Temporal Structure
Davide Di Gioia · Mar 23, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Long Horizon
Autonomous agents operating in continuous environments must decide not only what to do, but when to act.
- Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing
Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul · Mar 6, 2026 · Citations: 0
Human EvalAutomatic Metrics Long Horizon
We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single-pass retrieval with an iterative tool-calling loop, supporting end-to-end Excel workflows from complex analysis…
- ReDAct: Uncertainty-Aware Deferral for LLM Agents
Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov, Nikita Glazkov, Ivan Nasonov · Apr 8, 2026 · Citations: 0
Simulation Env Long Horizon
Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems.
- Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026 · Citations: 0
Simulation Env Long Horizon
We propose GiG, a novel planning framework that structures embodied agents' memory using a Graph-in-Graph architecture.
- From Control to Foresight: Simulation as a New Paradigm for Human-Agent Collaboration
Gaole He, Brian Y. Lim · Mar 12, 2026 · Citations: 0
Pairwise Preference Simulation Env Long Horizon
Large Language Models (LLMs) are increasingly used to power autonomous agents for complex, multi-step tasks.
- DeceptGuard :A Constitutional Oversight Framework For Detecting Deception in LLM Agents
Snehasis Mukhopadhyay · Mar 14, 2026 · Citations: 0
Automatic MetricsSimulation Env Long Horizon
We introduce DECEPTGUARD, a unified framework that systematically compares three monitoring regimes: black-box monitors (actions and outputs only), CoT-aware monitors (additionally observing the agent's chain-of-thought reasoning trace),…
- FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
Michael Krumdick, Varshini Reddy, Shivani Chaudhary, William Day, Maarij Ahmed · Apr 7, 2026 · Citations: 0
Rubric Rating Long Horizon
To address this, we introduce FrontierFinance, a long-horizon benchmark of 25 complex financial modeling tasks across five core finance models, requiring an average of over 18 hours of skilled human labor per task to complete.
- TREX: Trajectory Explanations for Multi-Objective Reinforcement Learning
Dilina Rajapakse, Juan C. Rosero, Ivana Dusparic · Mar 23, 2026 · Citations: 0
Pairwise Preference Long Horizon
Multi-Objective Reinforcement Learning (MORL) addresses this limitation by enabling agents to optimize several objectives simultaneously, explicitly reasoning about trade-offs between them.
- DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling
Shicheng Liu, Yucheng Jiang, Sajid Farook, Camila Nicollier Sanchez, David Fernando Castro Pena · Apr 7, 2026 · Citations: 0
Human Eval Long Horizon
Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis.
- Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai · Mar 30, 2026 · Citations: 0
Critique Edit Long Horizon
We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe.
- MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue
Naifan Zhang, Ruihan Sun, Jinwei Su, Hengjie Yang, Zhengyuan Pan · Mar 6, 2026 · Citations: 0
Llm As JudgeSimulation Env Long Horizon
We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns.
- Mind the Shift: Decoding Monetary Policy Stance from FOMC Statements with Large Language Models
Yixuan Tang, Yi Yang · Mar 15, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics Long Horizon
Across four LLM backbones, DCS consistently outperforms supervised probes and LLM-as-judge baselines, achieving up to 71.1% accuracy on sentence-level hawkish--dovish classification.
- SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray · Feb 24, 2026 · Citations: 0
Simulation Env Long Horizon
Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning.
- Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu · Feb 15, 2026 · Citations: 0
Simulation Env Long Horizon
The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge…
- PMG: Parameterized Motion Generator for Human-like Locomotion Control
Chenxi Han, Yuheng Min, Zihao Huang, Ao Hong, Hang Liu · Feb 13, 2026 · Citations: 0
Automatic Metrics Long Horizon
Recent advances in data-driven reinforcement learning and motion tracking have substantially improved humanoid locomotion, yet critical practical challenges remain.
- Reasoning or Rhetoric? An Empirical Analysis of Moral Reasoning Explanations in Large Language Models
Aryan Kasat, Smriti Singh, Aman Chadha, Vinija Jain · Mar 23, 2026 · Citations: 0
Llm As Judge Long Horizon
Using an LLM-as-judge scoring pipeline validated across three judge models, we classify more than 600 responses from 13 LLMs spanning a range of architectures, parameter scales, and training regimes across six classical moral dilemmas, and…
- Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications
Che Chen, Lanhua Li, Shimin Gong, Yu Zhao, Yuming Fang · Mar 23, 2026 · Citations: 0
Simulation Env Long Horizon
To maximize the overall throughput, we first propose a delay-tolerant multi-agent deep reinforcement learning (MADRL) algorithm that integrates a delay-penalized reward to encourage information sharing among UAVs, while jointly optimizing…
- AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations
Cheng Jiayang, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao · Mar 2, 2026 · Citations: 0
Simulation Env Long Horizon
Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory.
- Self-Debias: Self-correcting for Debiasing Large Language Models
Xuan Feng, Shuai Zhao, Luwei Xiao, Tianlong Gu, Bo An · Apr 9, 2026 · Citations: 0
Pairwise Preference Long Horizon
Unlike standard preference optimization which applies broad penalties, Self-Debias employs a fine-grained trajectory-level objective subject to dynamic debiasing constraints.
- Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
Shiwan Zhao, Zhihu Wang, Xuyang Zhao, Jiaming Zhou, Caiyue Xu · Apr 9, 2026 · Citations: 0
Pairwise Preference Long Horizon
Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines.
- AgenticRec: End-to-End Tool-Integrated Policy Optimization for Ranking-Oriented Recommender Agents
Tianyi Li, Zixuan Wang, Guidong Lei, Xiaodong Li, Hui Li · Mar 23, 2026 · Citations: 0
Pairwise Preference Tool Use
To address this, we present AgenticRec, a ranking-oriented agentic recommendation framework that optimizes the entire decision-making trajectory (including intermediate reasoning, tool invocation, and final ranking list generation) under…
- HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning
Zhicong Lu, Zichuan Lin, Wei Jia, Changyuan Tian, Deheng Ye · Mar 19, 2026 · Citations: 0
Pairwise Preference Long Horizon
While large language models excel in diverse domains, their performance on complex longhorizon agentic decision-making tasks remains limited.
- IROSA: Interactive Robot Skill Adaptation using Natural Language
Markus Knauer, Samuel Bustamante, Thomas Eiband, Alin Albu-Schäffer, Freek Stulp · Mar 4, 2026 · Citations: 0
Demonstrations Long Horizon
We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and…
- MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents
Shu Wang, Edwin Yu, Oscar Love, Tom Zhang, Tom Wong · Apr 6, 2026 · Citations: 0
Automatic Metrics Long Horizon
Large Language Model (LLM) agents require persistent memory to maintain personalization, factual continuity, and long-horizon reasoning, yet standard context-window and retrieval-augmented generation (RAG) pipelines degrade over…
- OSCAR: Orchestrated Self-verification and Cross-path Refinement
Yash Shah, Abhijit Chakraborty, Naresh Kumar Devulapally, Vishnu Lokhande, Vivek Gupta · Apr 2, 2026 · Citations: 0
Automatic Metrics Long Horizon
We introduce a suite of trajectory-level assessments, including a cross-chain divergence-at-hallucination (CDH) metric, for principled comparison of localization methods.
- Asymmetric Actor-Critic for Multi-turn LLM Agents
Shuli Jiang, Zhaoyang Zhang, Yi Zhang, Shuo Yang, Wei Xia · Mar 31, 2026 · Citations: 0
Automatic Metrics Long Horizon
In many real-world applications, agents must succeed in one-shot settings where retries are impossible.
- EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises
Ankush Agarwal, Harsh Vishwakarma, Suraj Nagaje, Chaitanya Devaguptapu · Mar 23, 2026 · Citations: 0
Automatic Metrics Long Horizon
Deploying AI agents in enterprise environments requires balancing capability with data sovereignty and cost constraints.
- AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications.
- D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Shunsuke Ubukata · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as <TEMP_LOW> for fact-checking and <TEMP_HIGH> for multi-perspective exploration --…
- MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation
Taolin Han, Shuang Wu, Jinghang Wang, Yuhao Zhou, Renquan Lv · Mar 26, 2026 · Citations: 0
Automatic MetricsSimulation Env Long Horizon
Current scientific evaluation benchmarks predominantly rely on static, single-turn Question Answering (QA) formats, which are inadequate for measuring model performance in complex scientific tasks that require multi-step iteration and…
- ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning
Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li · Feb 25, 2026 · Citations: 0
Simulation Env Long Horizon
Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks.
- Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026 · Citations: 0
Pairwise Preference Long Horizon
Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
- SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions
Saroj Mishra, Suman Niroula, Umesh Yadav, Dilip Thakur, Srijan Gyawali · Mar 7, 2026 · Citations: 0
Automatic Metrics Long Horizon
Retrieval-Augmented Generation (RAG) systems are increasingly evolving into agentic architectures where large language models autonomously coordinate multi-step reasoning, dynamic memory management, and iterative retrieval strategies.
- Contextualized Privacy Defense for LLM Agents
Yule Wen, Yanzhe Zhang, Jianxun Lian, Xiaoyuan Yi, Xing Xie · Mar 3, 2026 · Citations: 0
Simulation Env Long Horizon
LLM agents increasingly act on users' personal information, yet existing privacy defenses remain limited in both design and adaptability.
- LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding · Feb 25, 2026 · Citations: 0
Simulation Env Long Horizon
We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.
- TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation
Xinliang Frederick Zhang, Lu Wang · Apr 9, 2026 · Citations: 0
Pairwise Preference Long Horizon
Personalized large language models (PLLMs) have garnered significant attention for their ability to align outputs with individual's needs and preferences.
- Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar · Feb 20, 2026 · Citations: 0
Pairwise Preference Long Horizon
When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed.
- PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
Zhifei Xie, Zongzheng Hu, Fangda Ye, Xin Zhang, Haobo Chai · Apr 9, 2026 · Citations: 0
Automatic Metrics Long Horizon
Prior work remains largely confined to laboratory settings, leaving a clear gap in real-world proactive agent: depth, complexity, ambiguity, precision and real-time constraints.
- Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency
Guan-Ting Lin, Chen Chen, Zhehuai Chen, Hung-yi Lee · Apr 6, 2026 · Citations: 0
Automatic Metrics Tool Use
We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use.
- $\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi · Apr 1, 2026 · Citations: 0
Automatic Metrics Long Horizon
As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound.
- Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards
Cheng Jiayang, Xin Liu, Zhihan Zhang, Haoyang Wen, Zixuan Zhang · Mar 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
We present a framework addressing both challenges.
- DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao · Feb 27, 2026 · Citations: 0
Automatic Metrics Long Horizon
The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking.
- Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios.
- LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning
Xinwu Ye, Yicheng Mao, Jia Zhang, Yimeng Liu, Li Hao · Feb 6, 2026 · Citations: 0
Automatic Metrics Long Horizon
Across diverse chemical reasoning benchmarks, LatentChem achieves a 59.88\% non-tie win rate over strong CoT-based baselines on ChemCoTBench, while delivering a 10.84\times average reduction in reasoning overhead.
- Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving
Yinan Zheng, Tianyi Tan, Bin Huang, Enguang Liu, Ruiming Liang · Feb 26, 2026 · Citations: 0
Simulation Env Long Horizon
However, their applications and evaluations in autonomous driving remain limited to simulation-based or laboratory settings.
- Efficient Hierarchical Any-Angle Path Planning on Multi-Resolution 3D Grids
Victor Reijgwart, Cesar Cadena, Roland Siegwart, Lionel Ott · Feb 24, 2026 · Citations: 0
Simulation Env Long Horizon
Hierarchical, multi-resolution volumetric mapping approaches are widely used to represent large and complex environments as they can efficiently capture their occupancy and connectivity information.
- From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents
Jiaxuan Gao, Jiaao Chen, Chuyi He, Shusheng Xu, Di Jin · Jan 30, 2026 · Citations: 0
Simulation Env Long Horizon
Interactive tool-using agents must solve real-world tasks via multi-turn interaction with both humans and external environments, requiring dialogue state tracking, multi-step tool execution, while following complex instructions.
- LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications
Mayank Mayank, Bharanidhar Duraisamy, Florian Geiss · Apr 2, 2026 · Citations: 0
Automatic Metrics Long Horizon
Evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset demonstrate real-time computational efficiency suitable for production systems; additional validation on public datasets such as View of Delft (VoD) further confirms cross-dataset…
- AgentSwing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents
Zhaopeng Feng, Liangcai Su, Zhen Zhang, Xinyu Wang, Xiaotian Zhang · Mar 29, 2026 · Citations: 0
Automatic Metrics Long Horizon
As large language models (LLMs) evolve into autonomous agents for long-horizon information-seeking, managing finite context capacity has become a critical bottleneck.
- S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation
Ligong Han, Hao Wang, Han Gao, Kai Xu, Akash Srivastava · Mar 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models.
- Sell Me This Stock: Unsafe Recommendation Drift in LLM Agents
Zekun Wu, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz · Mar 13, 2026 · Citations: 0
Automatic Metrics Long Horizon
Tool-augmented LLM agents increasingly operate as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking metrics that measure what is recommended but not whether it is safe for the user.
- Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement
Dongxu Zhang, Hongqiang Lin, Yiding Sun, Pengyu Wang, Qirui Wang · Mar 9, 2026 · Citations: 0
Automatic Metrics Long Horizon
To address this, we propose CoFiCot, a coarse-to-fine adaptive framework that dynamically tailors inference strategies to problem difficulty.
- LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval
Jiajie Jin, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie · Mar 2, 2026 · Citations: 0
Automatic Metrics Long Horizon
Extensive experiments on both in-domain and out-of-domain reasoning-intensive benchmarks demonstrate that LaSER significantly outperforms state-of-the-art baselines.