Audio-Visual Speech Recognition
Audio-visual speech recognition is the task of transcribing a paired audio and visual stream into text.
Papers
Showing 1–10 of 100 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Hyb-Conformer | Word Error Rate (WER) | 2.3 | — | Unverified |
| 2 | Zero-AVSR | Word Error Rate (WER) | 1.5 | — | Unverified |
| 3 | AV-HuBERT Large | Word Error Rate (WER) | 1.4 | — | Unverified |
| 4 | Whisper-Flamingo | Word Error Rate (WER) | 0.76 | — | Unverified |
| 5 | MMS-LLaMA | Word Error Rate (WER) | 0.74 | — | Unverified |