Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR
Thibault Bañeras-Roux, Sergio Burdisso, Esaú Villatoro-Tello, Dairazalia Sánchez-Cortés, Shiran Liu, Severin Baroudi, Shashi Kumar, Hasindri Watawana, Manjunath K E, Kadri Hacioglu, Petr Motlicek, Andreas Stolcke · Apr 7, 2026 · Citations: 0
Data freshness
Extraction: FreshCheck recency before relying on this page for active eval decisions. Use stale pages as context and verify against current hub results.
Metadata refreshed
Apr 7, 2026, 9:41 PM
RecentExtraction refreshed
Apr 10, 2026, 7:26 AM
FreshExtraction source
Persisted extraction
Confidence 0.20
Abstract
Conventional end-to-end automatic speech recognition (ASR) systems rely on paired speech-text data for domain adaptation. Recent LLM-based ASR architectures connect a speech encoder to a large language model via a projection module, enabling adaptation with text-only data. However, this introduces a modality gap, as the LLM is not exposed to the noisy representations produced by the speech projector. We investigate whether small amounts of speech can mitigate this mismatch. We compare three strategies: text-only adaptation, paired speech-text adaptation, and mixed batching (MB), which combines both. Experiments in in-domain and out-of-domain settings show that even limited speech consistently improves performance. Notably, MB using only 10% of the target-domain (less than 4 hours) speech achieves word error rates comparable to, or better than, conventional ASR fine-tuning with the full dataset, indicating that small amounts of speech provide a strong modality-alignment signal.