Audio-Visual Speech Recognition
Audio-visual speech recognition is the task of transcribing a paired audio and visual stream into text.
Papers
Showing 1–10 of 100 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Hybrid CTC / Attention | Word Error Rate (WER) | 39.1 | — | Unverified |
| 2 | TM-Seq2seq | Test WER | 8.5 | — | Unverified |
| 3 | TM-CTC | Test WER | 8.2 | — | Unverified |
| 4 | CTC/Attention | Test WER | 7 | — | Unverified |
| 5 | CTC/Attention | Test WER | 1.5 | — | Unverified |
| 6 | Whisper-Flamingo | Test WER | 1.4 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Hyb-Conformer | Word Error Rate (WER) | 2.3 | — | Unverified |
| 2 | Zero-AVSR | Word Error Rate (WER) | 1.5 | — | Unverified |
| 3 | AV-HuBERT Large | Word Error Rate (WER) | 1.4 | — | Unverified |
| 4 | Whisper-Flamingo | Word Error Rate (WER) | 0.76 | — | Unverified |
| 5 | MMS-LLaMA | Word Error Rate (WER) | 0.74 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | AVCRFormer | Top-1 Accuracy | 98.81 | — | Unverified |
| 2 | 2DCNN + BiLSTM + ResNet + MLF | Top-1 Accuracy | 98.76 | — | Unverified |
| 3 | PBL | Top-1 Accuracy | 98.3 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | ES³ Base* | Word Error Rate (WER) | 11 | — | Unverified |