SOTAVerified

Audio-Visual Speech Recognition

Audio-visual speech recognition is the task of transcribing a paired audio and visual stream into text.

Papers

Showing 1120 of 100 papers

TitleStatusHype
How to Teach DNNs to Pay Attention to the Visual Modality in Speech RecognitionCode1
Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech RecognitionCode1
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation ModelsCode1
Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech RecognitionCode1
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion EncoderCode1
Deep Audio-Visual Speech RecognitionCode1
Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech RecognitionCode1
Discriminative Multi-modality Speech RecognitionCode1
CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command RecognitionCode1
AV Taris: Online Audio-Visual Speech RecognitionCode1
Show:102550
← PrevPage 2 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Hybrid CTC / AttentionWord Error Rate (WER)39.1Unverified
2TM-Seq2seqTest WER8.5Unverified
3TM-CTCTest WER8.2Unverified
4CTC/AttentionTest WER7Unverified
5CTC/AttentionTest WER1.5Unverified
6Whisper-FlamingoTest WER1.4Unverified
#ModelMetricClaimedVerifiedStatus
1Hyb-ConformerWord Error Rate (WER)2.3Unverified
2Zero-AVSRWord Error Rate (WER)1.5Unverified
3AV-HuBERT LargeWord Error Rate (WER)1.4Unverified
4Whisper-FlamingoWord Error Rate (WER)0.76Unverified
5MMS-LLaMAWord Error Rate (WER)0.74Unverified
#ModelMetricClaimedVerifiedStatus
1AVCRFormerTop-1 Accuracy98.81Unverified
22DCNN + BiLSTM + ResNet + MLFTop-1 Accuracy98.76Unverified
3PBLTop-1 Accuracy98.3Unverified
#ModelMetricClaimedVerifiedStatus
1ES³ Base*Word Error Rate (WER)11Unverified