Speech Recognition
Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.
( Image credit: SpecAugment )
Papers
Showing 1–10 of 6433 papers
All datasetsLibriSpeech test-cleanLibriSpeech test-otherSwitchboard + Hub500TIMITAISHELL-1WSJ eval92Common Voice Germanswb_hub_500 WER fullSWBCHTUDACommon Voice FrenchCommon Voice SpanishMediaSpeech
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | AmNet | Word Error Rate (WER) | 8.6 | — | Unverified |
| 2 | HMM-(SAT)GMM | Word Error Rate (WER) | 8 | — | Unverified |
| 3 | Local Prior Matching (Large Model) | Word Error Rate (WER) | 7.19 | — | Unverified |
| 4 | Snips | Word Error Rate (WER) | 6.4 | — | Unverified |
| 5 | Li-GRU | Word Error Rate (WER) | 6.2 | — | Unverified |
| 6 | HMM-DNN + pNorm* | Word Error Rate (WER) | 5.5 | — | Unverified |
| 7 | CTC + policy learning | Word Error Rate (WER) | 5.42 | — | Unverified |
| 8 | Deep Speech 2 | Word Error Rate (WER) | 5.33 | — | Unverified |
| 9 | HMM-TDNN + iVectors | Word Error Rate (WER) | 4.8 | — | Unverified |
| 10 | Gated ConvNets | Word Error Rate (WER) | 4.8 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Local Prior Matching (Large Model) | Word Error Rate (WER) | 20.84 | — | Unverified |
| 2 | Snips | Word Error Rate (WER) | 16.5 | — | Unverified |
| 3 | Local Prior Matching (Large Model, ConvLM LM) | Word Error Rate (WER) | 15.28 | — | Unverified |
| 4 | Deep Speech 2 | Word Error Rate (WER) | 13.25 | — | Unverified |
| 5 | TDNN + pNorm + speed up/down speech | Word Error Rate (WER) | 12.5 | — | Unverified |
| 6 | CTC-CRF 4gram-LM | Word Error Rate (WER) | 10.65 | — | Unverified |
| 7 | Convolutional Speech Recognition | Word Error Rate (WER) | 10.47 | — | Unverified |
| 8 | MT4SSL | Word Error Rate (WER) | 9.6 | — | Unverified |
| 9 | Jasper DR 10x5 | Word Error Rate (WER) | 8.79 | — | Unverified |
| 10 | Espresso | Word Error Rate (WER) | 8.7 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Deep Speech | Percentage error | 20 | — | Unverified |
| 2 | DNN-HMM | Percentage error | 18.5 | — | Unverified |
| 3 | CD-DNN | Percentage error | 16.1 | — | Unverified |
| 4 | DNN | Percentage error | 16 | — | Unverified |
| 5 | DNN + Dropout | Percentage error | 15 | — | Unverified |
| 6 | DNN BMMI | Percentage error | 12.9 | — | Unverified |
| 7 | DNN MPE | Percentage error | 12.9 | — | Unverified |
| 8 | DNN MMI | Percentage error | 12.9 | — | Unverified |
| 9 | HMM-TDNN + pNorm + speed up/down speech | Percentage error | 12.9 | — | Unverified |
| 10 | HMM-DNN +sMBR | Percentage error | 12.6 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | LSNN | Percentage error | 33.2 | — | Unverified |
| 2 | LAS multitask with indicators sampling | Percentage error | 20.4 | — | Unverified |
| 3 | Soft Monotonic Attention (ours, offline) | Percentage error | 20.1 | — | Unverified |
| 4 | QCNN-10L-256FM | Percentage error | 19.64 | — | Unverified |
| 5 | Bi-LSTM + skip connections w/ CTC | Percentage error | 17.7 | — | Unverified |
| 6 | Bi-RNN + Attention | Percentage error | 17.6 | — | Unverified |
| 7 | RNN-CRF on 24(x3) MFSC | Percentage error | 17.3 | — | Unverified |
| 8 | CNN in time and frequency + dropout, 17.6% w/o dropout | Percentage error | 16.7 | — | Unverified |
| 9 | Light Gated Recurrent Units | Percentage error | 16.7 | — | Unverified |
| 10 | GRU | Percentage error | 16.6 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Att | Word Error Rate (WER) | 18.7 | — | Unverified |
| 2 | CTC/Att | Word Error Rate (WER) | 6.7 | — | Unverified |
| 3 | BRA-E | Word Error Rate (WER) | 6.63 | — | Unverified |
| 4 | CTC-CRF 4gram-LM | Word Error Rate (WER) | 6.34 | — | Unverified |
| 5 | BAT | Word Error Rate (WER) | 4.97 | — | Unverified |
| 6 | Paraformer | Word Error Rate (WER) | 4.95 | — | Unverified |
| 7 | U2 | Word Error Rate (WER) | 4.72 | — | Unverified |
| 8 | UMA | Word Error Rate (WER) | 4.7 | — | Unverified |
| 9 | Lightweight Transducer | Word Error Rate (WER) | 4.31 | — | Unverified |
| 10 | CIF-HKD With LM | Word Error Rate (WER) | 4.1 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Jasper 10x3 | Word Error Rate (WER) | 6.9 | — | Unverified |
| 2 | CNN over RAW speech (wav) | Word Error Rate (WER) | 5.6 | — | Unverified |
| 3 | CTC-CRF 4gram-LM | Word Error Rate (WER) | 3.79 | — | Unverified |
| 4 | Deep Speech 2 | Word Error Rate (WER) | 3.6 | — | Unverified |
| 5 | test-set on open vocabulary (i.e. harder), model = HMM-DNN + pNorm* | Word Error Rate (WER) | 3.6 | — | Unverified |
| 6 | Convolutional Speech Recognition | Word Error Rate (WER) | 3.5 | — | Unverified |
| 7 | TC-DNN-BLSTM-DNN | Word Error Rate (WER) | 3.5 | — | Unverified |
| 8 | Espresso | Word Error Rate (WER) | 3.4 | — | Unverified |
| 9 | CTC-CRF VGG-BLSTM | Word Error Rate (WER) | 3.2 | — | Unverified |
| 10 | Transformer with Relaxed Attention | Word Error Rate (WER) | 3.19 | — | Unverified |