Skip to content
← Back to explorer

ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

Nicol Visser, Simon Malan, Danel Slabbert, Herman Kamper · Feb 17, 2026 · Citations: 0

Abstract

Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM's intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.

Human Data Lens

  • Uses human feedback: No
  • Feedback types: None
  • Rater population: Unknown
  • Unit of annotation: Unknown
  • Expertise required: Coding

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: None
  • Quality controls: Not reported
  • Confidence: 0.30
  • Flags: low_signal, possible_false_positive

Research Summary

Contribution Summary

  • Pure speech language models aim to learn language directly from raw audio without textual resources.
  • A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units.
  • However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines.

Why It Matters For Eval

  • ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks.

Related Papers