Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation

2025-05-27Unverified0· sign in to hype

Dancheng Liu, Amir Nassereldine, Chenhui Xu, JinJun Xiong

Unverified — Be the first to reproduce this paper.

Abstract

Whisper's robust performance in automatic speech recognition (ASR) is often attributed to its massive 680k-hour training set, an impractical scale for most researchers. In this work, we examine how linguistic and acoustic diversity in training data affect the robustness of the ASR model and reveal that transcription generalization is primarily driven by acoustic variation rather than linguistic richness. We find that targeted acoustic augmentation methods could significantly improve the generalization ability of ASR models, reducing word-error rates by up to 19.24 percent on unseen datasets when training on the 960-hour Librispeech dataset. These findings highlight strategic acoustically focused data augmentation as a promising alternative to massive datasets for building robust ASR models, offering a potential solution to future foundation ASR models when massive human speech data is lacking.

Tasks

Automatic Speech Recognition Automatic Speech Recognition (ASR)Data Augmentation Diversity speech-recognition Speech Recognition

Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation

Abstract

Tasks

Reproductions