Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization
MD. Sagor Chowdhury, Adiba Fairooz Chowdhury
Abstract
We describe our end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle. Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora. For ASR we achieve a best private Word Error Rate (WE...
Results & Benchmarks
| Task | Dataset | Metric | Value |
|---|---|---|---|
| Detection | SeamlessM4T (fixed) | Private WER. | 1.1502 |
| Detection | SeamlessM4T (VAD) | Public WER | 0.7435 |
| Computer vision | Tugstugi | WER | 34.07 |
| Computer vision | Ours (ASR) | WER | 37.74 |
Hardware Requirements
- Expect multi-day setup/compute for meaningful reproduction based on current guidance.
Best Implementation
Maintained implementation evidence is not confirmed for this paper yet.
Use the Implementation Status and Reproduction Path sections below for the current action plan.
Reproduction Path
Follow this baseline workflow to decide if this paper is worth immediate prototyping.
- 1
Use the paper and benchmark evidence to scope a baseline reproduction plan.
- 2
Track assumptions and missing details in an experiment log before coding.
Additional Implementations
No additional verified repositories beyond the primary recommendation.
Hugging Face Artifacts
No direct paper-linked artifacts were found. Showing strongest curated related artifacts.