Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language
Hanif Rahman, Shafeeq ur Rehman · Mar 27, 2026 · Citations: 0
Data freshness
Extraction: FreshCheck recency before relying on this page for active eval decisions. Use stale pages as context and verify against current hub results.
Metadata refreshed
Mar 27, 2026, 10:22 PM
RecentExtraction refreshed
Apr 10, 2026, 7:26 AM
FreshExtraction source
Persisted extraction
Confidence 0.20
Abstract
We present the Pashto Common Voice corpus -- the first large-scale, openly licensed speech resource for Pashto, a language with over 60 million native speakers largely absent from open speech technology. Through a community effort spanning 2022-2025, the corpus grew from 1.5 hours and 5 contributors to 147 total hours and 1,483 unique speakers across ten Mozilla Common Voice releases (CV14-CV23). Speaker participation increased approximately 108-fold between CV17 and CV18, coinciding with a VOA Pashto broadcast campaign. We describe the full methodology: interface localisation, Wikipedia-based sentence extraction with automated filtering, phonemically targeted contributions for the four most frequently dropped Pashto characters, and multi-channel community outreach. MCV23 contains 107,781 clips (60,337 validated; 82.33 validated hours) across 13 content domains. Fine-tuning Whisper Base on the MCV20 yields 13.4% WER on the MCV20 test split, against the published Whisper Base zero-shot WER of 99.0% on Pashto.