- TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026 · Citations: 0
Red Team Automatic Metrics Long Horizon
As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
- SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
- Structurally Aligned Subtask-Level Memory for Software Engineering Agents
Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
- The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration
Haoyuan Xu, Chang Li, Xinyan Ma, Xianhao Ou, Zihan Zhang · Mar 24, 2026 · Citations: 0
Automatic Metrics Tool Use
As agent systems evolve, however, the central problem has shifted from isolated invocation to multi-tool orchestration over long trajectories with intermediate state, execution feedback, changing environments, and practical constraints such…
- SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?
Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu · Feb 28, 2026 · Citations: 0
Automatic Metrics Long Horizon
Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting, and reusing higher-level tool…
- ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution.
- WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing
Fanheng Kong, Jingyuan Zhang, Yang Yue, Chenxi Sun, Yang Tian · Mar 26, 2026 · Citations: 0
Long Horizon
To address these gaps, we introduce WebTestBench, a benchmark for evaluating end-to-end automated web testing.
- SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks
Gabriel Orlanski, Devjeet Roy, Alexander Yun, Changho Shin, Alex Gu · Mar 25, 2026 · Citations: 0
Long Horizon
We introduce SlopCodeBench, a language-agnostic benchmark comprising 20 problems and 93 checkpoints, in which agents repeatedly extend their own prior solutions under evolving specifications that force architectural decisions without…
- Beyond Local Code Optimization: Multi-Agent Reasoning for Software System Optimization
Huiyun Peng, Parth Vinod Patil, Antonio Zhong Qiu, George K. Thiruvathukal, James C. Davis · Mar 16, 2026 · Citations: 0
Long Horizon
Large language models and AI agents have recently shown promise in automating software performance optimization, but existing approaches predominantly rely on local, syntax-driven code transformations.
- ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning
Juyong Jiang, Jiasi Shen, Sunghun Kim, Kang Min Yoo, Jeonghoon Kim · Mar 6, 2026 · Citations: 0
Long Horizon
Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder-8B establishes a new state-of-the-art (SOTA) among leading open-source models in the 1.5B-14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80%…
- SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training
Huatong Song, Lisheng Huang, Shuang Sun, Jinhao Jiang, Ran Le · Feb 3, 2026 · Citations: 0
Long Horizon
In this technical report, we present SWE-Master, an open-source and fully reproducible post-training framework for building effective software engineering agents.
- Process-Centric Analysis of Agentic Software Systems
Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra · Dec 2, 2025 · Citations: 0
Long Horizon
Agentic systems are modern software systems: they consist of orchestrated modules, expose interfaces, and are deployed in software pipelines.