Skip to content

Latent-Condensed Transformer for Efficient Long Context Modeling

Zeng You, Yaofo Chen, Qiuwu Chen, Ying Sun, Shuhai Zhang, +3 more

2026-04-14T08:40:31Z

Abstract

Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately: Multi-head Latent Attention (MLA) reduces the KV cache by projecting tokens into a low-dimensional latent space, while sparse attention reduces computation. However, sparse methods cannot operate natively on MLA's compressed latent structure, missing opportunities for joint optimization. In this paper, we propose Latent-Condensed Attention (LCA), which directly condenses context within MLA's latent space, where the representation is disentangled into semantic latent vectors and positional keys. LCA separately aggregates semantic vectors via query-aware pooling and preserves positional keys via anchor selection. This approach jointly reduces both computational cost and KV cache without adding parameters. Beyond MLA, LCA's design is architecture-agnostic and readily extends to other attention mechanisms such as GQA. Theoretically, we prove a length-independent error bound. Experiments show LCA achieves up to 2.5$\times$ prefilling speedup and 90% KV cache reduction at 128K context while maintaining competitive performance.

Full analysis loading… Code implementations, benchmark data, and reproduction guides are being assembled. Please check back shortly.

Browse all papers

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.