Deep Speech: Scaling up end-to-end speech recognition
Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/PaddlePaddle/PaddleSpeechOfficialpaddle★ 12,564
- github.com/Picovoice/stt-benchmarknone★ 687
- github.com/Picovoice/speech-to-text-benchmarknone★ 687
- github.com/msalhab96/SpeeQpytorch★ 51
- github.com/anssssss/Vietnamese-Speech-Recognitiontf★ 0
- github.com/bjtommychen/Keras_DeepSpeech2_SpeechRecognitiontf★ 0
- github.com/GeorgeFedoseev/DeepSpeechtf★ 0
- github.com/RezisEwig/unity_speechnone★ 0
- github.com/YuBeomGon/DeepSpeechtf★ 0
- github.com/WalterJohnson0/DeepSpeech-KerasRebuildtf★ 0
Abstract
We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| VoxForge American-Canadian | Deep Speech | Percentage error | 15.01 | — | Unverified |
| VoxForge Commonwealth | Deep Speech | Percentage error | 28.46 | — | Unverified |
| VoxForge European | Deep Speech | Percentage error | 31.2 | — | Unverified |
| VoxForge Indian | Deep Speech | Percentage error | 45.35 | — | Unverified |