Phonology-Guided Speech-to-Speech Translation for African Languages
Peter Ochieng, Dennis Kaburu
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
We present a prosody-guided framework for speech-to-speech translation (S2ST) that aligns and translates speech without transcripts by leveraging cross-linguistic pause synchrony. Analyzing a 6,000-hour East African news corpus spanning five languages, we show that within-phylum language pairs exhibit 30--40\% lower pause variance and over 3 higher onset/offset correlation compared to cross-phylum pairs. These findings motivate SPaDA, a dynamic-programming alignment algorithm that integrates silence consistency, rate synchrony, and semantic similarity. SPaDA improves alignment F_1 by +3--4 points and eliminates up to 38\% of spurious matches relative to greedy VAD baselines. Using SPaDA-aligned segments, we train SegUniDiff, a diffusion-based S2ST model guided by external gradients from frozen semantic and speaker encoders. SegUniDiff matches an enhanced cascade in BLEU (30.3 on CVSS-C vs.\ 28.9 for UnitY), reduces speaker error rate (EER) from 12.5\% to 5.3\%, and runs at an RTF of 1.02. To support evaluation in low-resource settings, we also release a three-tier, transcript-free BLEU suite (M1--M3) that correlates strongly with human judgments. Together, our results show that prosodic cues in multilingual speech provide a reliable scaffold for scalable, non-autoregressive S2ST.