Audio-Visual Speech Recognition
Audio-visual speech recognition is the task of transcribing a paired audio and visual stream into text.
Papers
Showing 1–10 of 100 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Hybrid CTC / Attention | Word Error Rate (WER) | 39.1 | — | Unverified |
| 2 | TM-Seq2seq | Test WER | 8.5 | — | Unverified |
| 3 | TM-CTC | Test WER | 8.2 | — | Unverified |
| 4 | CTC/Attention | Test WER | 7 | — | Unverified |
| 5 | CTC/Attention | Test WER | 1.5 | — | Unverified |
| 6 | Whisper-Flamingo | Test WER | 1.4 | — | Unverified |