SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network
William Chan, Daniel Park, Chris Lee, Yu Zhang, Quoc Le, Mohammad Norouzi
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
We present SpeechStew, a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal. SpeechStew simply mixes all of these datasets together, without any special re-weighting or re-balancing of the datasets. SpeechStew achieves SoTA or near SoTA results across a variety of tasks, without the use of an external language model. Our results include 9.0\% WER on AMI-IHM, 4.7\% WER on Switchboard, 8.3\% WER on CallHome, and 1.3\% on WSJ, which significantly outperforms prior work with strong external language models. We also demonstrate that SpeechStew learns powerful transfer learning representations. We fine-tune SpeechStew on a noisy low resource speech dataset, CHiME-6. We achieve 38.9\% WER without a language model, which compares to 38.6\% WER to a strong HMM baseline with a language model.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| AMI IMH | SpeechStew (100M) | Word Error Rate (WER) | 9 | — | Unverified |
| AMI SDM1 | SpeechStew (100M) | Word Error Rate (WER) | 21.7 | — | Unverified |
| CHiME-6 dev_gss12 | SpeechStew (1B) | Word Error Rate (WER) | 31.9 | — | Unverified |
| CHiME-6 eval | SpeechStew (1B) | Word Error Rate (WER) | 38.9 | — | Unverified |
| Common Voice | SpeechStew (1B) | Test WER | 10.8 | — | Unverified |
| LibriSpeech test-clean | SpeechStew (1B) | Word Error Rate (WER) | 1.7 | — | Unverified |
| LibriSpeech test-clean | SpeechStew (100M) | Word Error Rate (WER) | 2 | — | Unverified |
| LibriSpeech test-other | SpeechStew (1B) | Word Error Rate (WER) | 3.3 | — | Unverified |
| LibriSpeech test-other | SpeechStew (100M) | Word Error Rate (WER) | 4 | — | Unverified |
| Switchboard CallHome | SpeechStew (100M) | Word Error Rate (WER) | 8.3 | — | Unverified |
| Switchboard SWBD | SpeechStew (100M) | Word Error Rate (WER) | 4.7 | — | Unverified |
| Tedlium | SpeechStew (100M) | Word Error Rate (WER) | 5.3 | — | Unverified |
| WSJ eval92 | Speechstew 100M | Word Error Rate (WER) | 1.3 | — | Unverified |