Speech Recognition
Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.
( Image credit: SpecAugment )
Papers
Showing 1–10 of 6433 papers
All datasetsLibriSpeech test-cleanLibriSpeech test-otherSwitchboard + Hub500TIMITAISHELL-1WSJ eval92Common Voice Germanswb_hub_500 WER fullSWBCHTUDACommon Voice FrenchCommon Voice SpanishMediaSpeech
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Jasper 10x3 | Word Error Rate (WER) | 6.9 | — | Unverified |
| 2 | CNN over RAW speech (wav) | Word Error Rate (WER) | 5.6 | — | Unverified |
| 3 | CTC-CRF 4gram-LM | Word Error Rate (WER) | 3.79 | — | Unverified |
| 4 | test-set on open vocabulary (i.e. harder), model = HMM-DNN + pNorm* | Word Error Rate (WER) | 3.6 | — | Unverified |
| 5 | Deep Speech 2 | Word Error Rate (WER) | 3.6 | — | Unverified |
| 6 | Convolutional Speech Recognition | Word Error Rate (WER) | 3.5 | — | Unverified |
| 7 | TC-DNN-BLSTM-DNN | Word Error Rate (WER) | 3.5 | — | Unverified |
| 8 | Espresso | Word Error Rate (WER) | 3.4 | — | Unverified |
| 9 | CTC-CRF VGG-BLSTM | Word Error Rate (WER) | 3.2 | — | Unverified |
| 10 | Transformer with Relaxed Attention | Word Error Rate (WER) | 3.19 | — | Unverified |