SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network

2021-04-05Unverified0· sign in to hype

William Chan, Daniel Park, Chris Lee, Yu Zhang, Quoc Le, Mohammad Norouzi

Unverified — Be the first to reproduce this paper.

Abstract

We present SpeechStew, a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal. SpeechStew simply mixes all of these datasets together, without any special re-weighting or re-balancing of the datasets. SpeechStew achieves SoTA or near SoTA results across a variety of tasks, without the use of an external language model. Our results include 9.0\% WER on AMI-IHM, 4.7\% WER on Switchboard, 8.3\% WER on CallHome, and 1.3\% on WSJ, which significantly outperforms prior work with strong external language models. We also demonstrate that SpeechStew learns powerful transfer learning representations. We fine-tune SpeechStew on a noisy low resource speech dataset, CHiME-6. We achieve 38.9\% WER without a language model, which compares to 38.6\% WER to a strong HMM baseline with a language model.

Tasks

All Language Modeling Language Modelling speech-recognition Speech Recognition Transfer Learning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
AMI IMH	SpeechStew (100M)	Word Error Rate (WER)	9	—	Unverified
AMI SDM1	SpeechStew (100M)	Word Error Rate (WER)	21.7	—	Unverified
CHiME-6 dev_gss12	SpeechStew (1B)	Word Error Rate (WER)	31.9	—	Unverified
CHiME-6 eval	SpeechStew (1B)	Word Error Rate (WER)	38.9	—	Unverified
Common Voice	SpeechStew (1B)	Test WER	10.8	—	Unverified
LibriSpeech test-clean	SpeechStew (1B)	Word Error Rate (WER)	1.7	—	Unverified
LibriSpeech test-clean	SpeechStew (100M)	Word Error Rate (WER)	2	—	Unverified
LibriSpeech test-other	SpeechStew (1B)	Word Error Rate (WER)	3.3	—	Unverified
LibriSpeech test-other	SpeechStew (100M)	Word Error Rate (WER)	4	—	Unverified
Switchboard CallHome	SpeechStew (100M)	Word Error Rate (WER)	8.3	—	Unverified
Switchboard SWBD	SpeechStew (100M)	Word Error Rate (WER)	4.7	—	Unverified
Tedlium	SpeechStew (100M)	Word Error Rate (WER)	5.3	—	Unverified
WSJ eval92	Speechstew 100M	Word Error Rate (WER)	1.3	—	Unverified

SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network

Abstract

Tasks

Benchmark Results

Reproductions