Skip to content
← Back to explorer

Frequency-Ordered Tokenization for Better Text Compression

Maximilian Kalcher · Feb 26, 2026 · Citations: 0

Abstract

We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law). The method tokenizes text with Byte Pair Encoding (BPE), reorders the vocabulary so that frequent tokens receive small integer identifiers, and encodes the result with variable-length integers before passing it to any standard compressor. On enwik8 (100 MB Wikipedia), this yields improvements of 7.08 percentage points (pp) for zlib, 1.69 pp for LZMA, and 0.76 pp for zstd (all including vocabulary overhead), outperforming the classical Word Replacing Transform. Gains are consistent at 1 GB scale (enwik9) and across Chinese and Arabic text. We further show that preprocessing accelerates compression for computationally expensive algorithms: the total wall-clock time including preprocessing is 3.1x faster than raw zstd-22 and 2.4x faster than raw LZMA, because the preprocessed input is substantially smaller. The method can be implemented in under 50 lines of code.

Human Data Lens

  • Uses human feedback: No
  • Feedback types: None
  • Rater population: Unknown
  • Unit of annotation: Unknown
  • Expertise required: Law, Coding

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: None
  • Quality controls: Not reported
  • Confidence: 0.30
  • Flags: low_signal, possible_false_positive

Research Summary

Contribution Summary

  • We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law).
  • The method tokenizes text with Byte Pair Encoding (BPE), reorders the vocabulary so that frequent tokens receive small integer identifiers, and encodes the result with variable-length integers before passing it to any standard compressor.
  • On enwik8 (100 MB Wikipedia), this yields improvements of 7.08 percentage points (pp) for zlib, 1.69 pp for LZMA, and 0.76 pp for zstd (all including vocabulary overhead), outperforming the classical Word Replacing Transform.

Related Papers