Speech Recognition
Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.
( Image credit: SpecAugment )
Papers
Showing 1–10 of 6433 papers
All datasetsLibriSpeech test-cleanLibriSpeech test-otherSwitchboard + Hub500TIMITAISHELL-1WSJ eval92Common Voice Germanswb_hub_500 WER fullSWBCHTUDACommon Voice FrenchCommon Voice SpanishMediaSpeech
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | HMM-TDNN + pNorm + speed up/down speech | Percentage error | 19.3 | — | Unverified |
| 2 | DNN + Dropout | Percentage error | 19.1 | — | Unverified |
| 3 | HMM-DNN +sMBR | Percentage error | 18.4 | — | Unverified |
| 4 | HMM-TDNN + iVectors | Percentage error | 17.1 | — | Unverified |
| 5 | CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trainedonlyon SWB | Percentage error | 16 | — | Unverified |
| 6 | HMM-TDNN trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + Fisher (10% / 15.1% respectively trained on SWBD only) | Percentage error | 13.3 | — | Unverified |
| 7 | HMM-BLSTM trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + Fisher | Percentage error | 13 | — | Unverified |
| 8 | RNN + VGG + LSTM acoustic model trained on SWB+Fisher+CH, N-gram + "model M" + NNLM language model | Percentage error | 12.2 | — | Unverified |
| 9 | VGG/Resnet/LACE/BiLSTM acoustic model trained on SWB+Fisher+CH, N-gram + RNNLM language model trained on Switchboard+Fisher+Gigaword+Broadcast | Percentage error | 11.9 | — | Unverified |
| 10 | ResNet + BiLSTMs acoustic model | Percentage error | 10.3 | — | Unverified |