- Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
- Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA
Fengyu Li, Junhao Zhu, Kaishi Song, Lu Chen, Zhongming Yao · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
Experiments on two benchmark datasets show that, with the same LLM backbone, Operation-R1 achieves average absolute accuracy gains of 9.55 and 6.08 percentage points over multi-step preparation baselines, with 79\% table compression and a 2
- Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios.
- Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed s
- How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space.
- GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
- SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
- D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Shunsuke Ubukata · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces "overthinking" in Small Language Models (SLMs), leading to performance degradation and excessive token consumption.
- VecGlypher: Unified Vector Glyph Generation with Language Models
Xiaoke Huang, Bhavul Gauri, Kam Woh Ng, Tony Ng, Mengmeng Xu · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
On cross-family OOD evaluation, VecGlypher substantially outperforms both general-purpose LLMs and specialized vector-font baselines for text-only generation, while image-referenced generation reaches a state-of-the-art performance, with ma
- Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidat
- SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning.
- ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee · Feb 24, 2026 · Citations: 0
Simulation Env Long Horizon
We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution.
- Semantic Novelty at Scale: Narrative Shape Taxonomy and Readership Prediction in 28,606 Books
W. Frederick Zimmerman · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
I introduce semantic novelty--cosine distance between each paragraph's sentence embedding and the running centroid of all preceding paragraphs--as an information-theoretic measure of narrative structure at corpus scale.
- GATES: Self-Distillation under Privileged Context with Consensus Gating
Alex Stein, Furong Huang, Tom Goldstein · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
Held-out in-domain accuracy under asymmetric evaluation improves from 46.0\% to 62.0\%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2\% to 35.4\%.
- Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song · Feb 23, 2026 · Citations: 0
Automatic Metrics Long Horizon
We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains.
- Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations
Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore · Feb 22, 2026 · Citations: 0
Automatic Metrics Long Horizon
Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows.
- Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Long Horizon
Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
- VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Diogo Glória-Silva, David Semedo, João Maglhães · Feb 22, 2026 · Citations: 0
Automatic Metrics Long Horizon
Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.
- AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG
Qijie You, Wenkai Yu, Wentao Zhang · Feb 22, 2026 · Citations: 0
Automatic Metrics Long Horizon
With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction.
- Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng · Feb 22, 2026 · Citations: 0
Automatic Metrics Long Horizon
Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.
- Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks
Wilson Y. Lee · Feb 22, 2026 · Citations: 0
Automatic Metrics Long Horizon
Why do language agents fail on tasks they are capable of solving?
- Watermarking LLM Agent Trajectories
Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li · Feb 21, 2026 · Citations: 0
Automatic Metrics Long Horizon
LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering.
- Semantic Substrate Theory: An Operator-Theoretic Framework for Geometric Semantic Drift
Stephen Russell · Feb 21, 2026 · Citations: 0
Automatic Metrics Long Horizon
Most semantic drift studies report multiple signals e.g., embedding displacement, neighbor changes, distributional divergence, and recursive trajectory instability, without a shared explanatory theory that relates them.
- Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar · Feb 20, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Long Horizon
When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed.
- Sink-Aware Pruning for Diffusion Language Models
Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen · Feb 19, 2026 · Citations: 0
Automatic Metrics Long Horizon
Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning.
- KLong: Training LLM Agent for Extremely Long-horizon Tasks
Yue Liu, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi · Feb 19, 2026 · Citations: 0
Rubric Rating Automatic Metrics Long Horizon
This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks.
- BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios
Yunseung Lee, Subin Kim, Youngjun Kwak, Jaegul Choo · Feb 19, 2026 · Citations: 0
Automatic Metrics Long Horizon
However, such errors have rarely been captured by existing benchmarks.
- Large Language Models Persuade Without Planning Theory of Mind
Jared Moore, Rasmus Overmark, Ned Cooper, Beba Cibralic, Nick Haber · Feb 19, 2026 · Citations: 0
Automatic Metrics Long Horizon
A growing body of work attempts to evaluate the theory of mind (ToM) abilities of humans and large language models (LLMs) using static, non-interactive question-and-answer benchmarks.
- Creating a digital poet
Vered Tohar, Tsahi Hayat, Amir Leshem · Feb 18, 2026 · Citations: 0
Automatic Metrics Long Horizon
In a blinded authorship test with 50 humanities students and graduates (three AI poems and three poems by well-known poets each), judgments were at chance: human poems were labeled human 54% of the time and AI poems 52%, with 95% confidence
- TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers
Ido Levy, Eilam Shapira, Yinon Goldshtein, Avi Yaeli, Nir Mashkif · Feb 18, 2026 · Citations: 0
Automatic Metrics Long Horizon
Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating,
- GLM-5: from Vibe Coding to Agentic Engineering
GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou · Feb 17, 2026 · Citations: 0
Automatic Metrics Long Horizon
We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering.
- FrameRef: A Framing Dataset and Simulation Testbed for Modeling Bounded Rational Information Health
Victor De Lima, Jiqun Liu, Grace Hui Yang · Feb 17, 2026 · Citations: 0
Human EvalSimulation Env Long Horizon
Within this framework, we construct framing-sensitive agent personas by fine-tuning language models with framing-conditioned loss attenuation, inducing targeted biases while preserving overall task competence.
- OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction
Skyler Hallinan, Thejas Venkatesh, Xiang Ren, Sai Praneeth Karimireddy, Ashwin Paranjape · Feb 16, 2026 · Citations: 0
Simulation Env Tool Use
Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks.
- A Geometric Analysis of Small-sized Language Model Hallucinations
Emanuele Ricco, Elia Onofri, Lorenzo Cima, Stefano Cresci, Roberto Di Pietro · Feb 16, 2026 · Citations: 0
Automatic Metrics Long Horizon
Hallucinations -- fluent but factually incorrect responses -- pose a major challenge to the reliability of language models, especially in multi-step or agentic settings.
- Unlocking Reasoning Capability on Machine Translation in Large Language Models
Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio · Feb 16, 2026 · Citations: 0
Critique Edit Automatic Metrics Long Horizon
We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.
- TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents
Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Heiko Ludwig, Holger Boche · Feb 12, 2026 · Citations: 0
Simulation Env Long Horizon
Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks.
- UI-Venus-1.5 Technical Report
Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu · Feb 9, 2026 · Citations: 0
Simulation Env Long Horizon
GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.
- OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering
Yifan Zhu, Xinyu Mu, Tao Feng, Zhonghong Ou, Yuning Gong · Feb 3, 2026 · Citations: 0
Automatic Metrics Tool Use
To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning.
- SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training
Huatong Song, Lisheng Huang, Shuang Sun, Jinhao Jiang, Ran Le · Feb 3, 2026 · Citations: 0
Simulation Env Long Horizon
In this technical report, we present SWE-Master, an open-source and fully reproducible post-training framework for building effective software engineering agents.
- What If We Allocate Test-Time Compute Adaptively?
Ahsan Bilal, Ahmed Mohsin, Muhammad Umer, Ali Subhan, Hassan Rizwan · Feb 1, 2026 · Citations: 0
Automatic Metrics Long Horizon
For each problem, the agent runs multiple inference iterations.
- Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026 · Citations: 0
Simulation Env Long Horizon
While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.
- Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization
Jingyi Xu, Xingyu Ren, Zhoupeng Shou, Yumeng Zhang, Zhiqiang You · Jan 24, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Long Horizon
Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success.
- APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0
Rubric RatingExpert Verification Simulation Env Long Horizon
We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate law
- Towards Efficient Agents: A Co-Design of Inference Architecture and System
Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu · Dec 20, 2025 · Citations: 0
Automatic Metrics Long Horizon
The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making.
- Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors
Qiming Bao, Xiaoxuan Fu, Michael Witbrock · Dec 6, 2025 · Citations: 0
Automatic Metrics Long Horizon
We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs.
- Stabilizing Off-Policy Training for Long-Horizon LLM Agent via Turn-Level Importance Sampling and Clipping-Triggered Normalization
Chenliang Li, Adel Elmahdy, Alex Boyd, Zhongruo Wang, Siliang Zeng · Nov 25, 2025 · Citations: 0
Automatic Metrics Long Horizon
Reinforcement learning (RL) algorithms such as PPO and GRPO are widely used to train large language models (LLMs) for multi-turn agentic tasks.
- Bridging Symbolic Control and Neural Reasoning in LLM Agents: Structured Cognitive Loop with a Governance Layer
Myung Ho Kim · Nov 21, 2025 · Citations: 0
Automatic Metrics Long Horizon
Large language model agents suffer from fundamental architectural problems: entangled reasoning and execution, memory volatility, and uncontrolled action sequences.
- Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces
Shreyas Rajesh, Pavan Holur, Chenda Duan, David Chong, Vwani Roychowdhury · Nov 10, 2025 · Citations: 0
Automatic Metrics Long Horizon
On the Episodic Memory Benchmark (EpBench) \cite{huet_episodic_2025} comprising corpora ranging from 100k to 1M tokens in length, GSW outperforms existing RAG based baselines by up to \textbf{20\%}.
- BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen · Oct 31, 2025 · Citations: 0
Pairwise Preference Automatic MetricsSimulation Env Long Horizon
Recent advances in Vision-Language Models (VLMs) have propelled embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs.
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han · Oct 29, 2025 · Citations: 0
Demonstrations Automatic Metrics Long Horizon
Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
- The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu · Oct 29, 2025 · Citations: 0
Simulation Env Long Horizon
Real-world language agents must handle complex, multi-step workflows across diverse Apps.
- RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA
Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025 · Citations: 0
Automatic Metrics Long Horizon
A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes can
- PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation
Xiangjun Zai, Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu · Oct 14, 2025 · Citations: 0
Automatic Metrics Long Horizon
Experiments across multiple domains demonstrate that PRoH achieves state-of-the-art performance, surpassing the prior SOTA model HyperGraphRAG by an average of 19.73% in F1 and 8.41% in Generation Evaluation (G-E) score, while maintaining s
- RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility
Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang · Sep 27, 2025 · Citations: 0
Automatic Metrics Long Horizon
Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors.
- CoAct-1: Computer-using Multi-Agent System with Coding Actions
Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi · Aug 5, 2025 · Citations: 0
Automatic Metrics Long Horizon
Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks.
- iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering
Shuai Wang, Yinan Yu · Jun 2, 2025 · Citations: 0
Automatic Metrics Long Horizon
Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.
- Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation
Safeyah Khaled Alshemali, Daniel Bauer, Yuval Marton · Oct 19, 2024 · Citations: 0
Automatic Metrics Long Horizon
We set a new state-of-the-art on thematic fit benchmarks, but show that closed and open weight LLMs respond differently to our prompting strategies: Closed models achieve better scores overall and benefit from multi-step reasoning, but they
- MathScape: Benchmarking Multimodal Large Language Models in Real-World Mathematical Contexts
Hao Liang, Linzhuang Sun, Minxuan Zhou, Zirong Chen, Meiyi Qiang · Aug 14, 2024 · Citations: 0
Automatic Metrics Long Horizon
While existing benchmarks such as MathVista and MathVerse have advanced the evaluation of multimodal math proficiency, they primarily rely on digitally rendered content and fall short in capturing the complexity of real-world scenarios.