When Can Transformers Count to n?
Gilad Yehudai, Haim Kaplan, Guy Dar, Royi Rassin, Asma Ghandeharioun, Mor Geva, Amir Globerson · Jul 21, 2024 · Citations: 0
Abstract
Large language models based on the transformer architecture can solve highly complex tasks, yet their fundamental limitations on simple algorithmic problems remain poorly understood. In this work, we focus on basic counting tasks and investigate how the difficulty of these tasks scales with the transformer embedding dimension, the context length, and the vocabulary size. We reveal a sharp theoretical phase transition governed by the relationship between the embedding dimension and the vocabulary size. When the dimension is at least as large as the vocabulary, transformers can perfectly maintain token counts. However, when the vocabulary exceeds the embedding dimension, the interference between non-orthogonal token representations forces the network weights to scale polynomially. This renders the exact counting algorithm numerically unstable and practically unlearnable. We empirically validate this bottleneck by training transformers from scratch, demonstrating a strict performance drop at the theoretical threshold and catastrophic out of distribution failure when scaling the vocabulary or context length. Furthermore, we show that state-of-the-art pretrained models suffer from similar failure cases. Our work reveals a critical blind spot absent from the current literature regarding the connection among these three parameters, proving that vocabulary size fundamentally dictates the difficulty of counting tasks.