Speech Recognition
Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.
( Image credit: SpecAugment )
Papers
Showing 1–10 of 6433 papers
All datasetsLibriSpeech test-cleanLibriSpeech test-otherSwitchboard + Hub500TIMITAISHELL-1WSJ eval92Common Voice Germanswb_hub_500 WER fullSWBCHTUDACommon Voice FrenchCommon Voice SpanishMediaSpeech
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | wav2vec 2.0 XLS-R (no LM) | Test WER | 12.06 | — | Unverified |
| 2 | wav2vec 2.0 XLS-R 1B + TEVR (no LM) | Test WER | 10.1 | — | Unverified |
| 3 | VoxPopuli (n-gram) | Test WER | 7.8 | — | Unverified |
| 4 | QuartzNet15x5DE (CV-only, 5-gram) | Test WER | 7.7 | — | Unverified |
| 5 | ConformerCTC-L (no LM) | Test WER | 7.33 | — | Unverified |
| 6 | ConformerCTC-L (no LM) | Test WER | 6.68 | — | Unverified |
| 7 | QuartzNet15x5DE (D37, 5-gram) | Test WER | 6.6 | — | Unverified |
| 8 | Whisper (Large v2) | Test WER | 6.4 | — | Unverified |
| 9 | Conformer Transducer (no LM) | Test WER | 6.28 | — | Unverified |
| 10 | ConformerCTC-L (4-gram) | Test WER | 6.03 | — | Unverified |