SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

2019-04-18Code Available1· sign in to hype

Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le

Code Available — Be the first to reproduce this paper.

Code

github.com/iver56/audiomentations
pytorch★ 2,242
github.com/google-research/leaf-audio
tf★ 521
github.com/SarthakYadav/audax
jax★ 72
github.com/AmirmohammadRostami/KeywordsSpotting-EfficientNet-A0
pytorch★ 23
github.com/park-cheol/ASR-Conformer
pytorch★ 15
github.com/audio-westlakeu/rct
pytorch★ 14
github.com/Audio-WestlakeU/RCT-Random-Consistency-Training
pytorch★ 14
github.com/HLasse/wav2vec_finetune
pytorch★ 10
github.com/cosmoquester/speech-recognition
tf★ 8
github.com/HeleneFabia/keyword-spotter
pytorch★ 2

Abstract

We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.

Tasks

Automatic Speech Recognition Automatic Speech Recognition (ASR)Data Augmentation Language Modeling Language Modelling Speech Recognition

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Hub5'00 SwitchBoard	LAS + SpecAugment (with LM, Switchboard mild policy)	SwitchBoard	6.8	—	Unverified
Hub5'00 SwitchBoard	LAS + SpecAugment (with LM, Switchboard strong policy)	SwitchBoard	7.1	—	Unverified
LibriSpeech test-clean	LAS + SpecAugment	Word Error Rate (WER)	2.5	—	Unverified
LibriSpeech test-clean	LAS (no LM)	Word Error Rate (WER)	2.7	—	Unverified
LibriSpeech test-other	LAS + SpecAugment	Word Error Rate (WER)	5.8	—	Unverified
LibriSpeech test-other	LAS (no LM)	Word Error Rate (WER)	6.5	—	Unverified

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Code

Abstract

Tasks

Benchmark Results

Reproductions