SOTAVerified

Visual Speech Recognition

Papers

Showing 2650 of 182 papers

TitleStatusHype
MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech RecognitionCode1
OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality AlignmentCode1
MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth InformationCode1
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task GeneralizationCode1
Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech RecognitionCode1
Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability ScoringCode1
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and RecognitionCode1
OLKAVS: An Open Large-Scale Korean Audio-Visual Speech DatasetCode1
Jointly Learning Visual and Auditory Speech Representations from Raw DataCode1
Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech RecognitionCode1
CI-AVSR: A Cantonese Audio-Visual Speech Datasetfor In-car Command RecognitionCode1
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech RecognitionCode1
CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command RecognitionCode1
End-to-end Audio-visual Speech Recognition with ConformersCode1
Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery DetectionCode1
AV Taris: Online Audio-Visual Speech RecognitionCode1
Learn an Effective Lip Reading Model without PainsCode1
Should we hard-code the recurrence concept or learn it instead ? Exploring the Transformer architecture for Audio-Visual Speech RecognitionCode1
How to Teach DNNs to Pay Attention to the Visual Modality in Speech RecognitionCode1
Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech RecognitionCode1
Deep Audio-Visual Speech RecognitionCode1
Zero-shot keyword spotting for visual speech recognition in-the-wildCode1
VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis0
ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition0
Cocktail-Party Audio-Visual Speech Recognition0
Show:102550
← PrevPage 2 of 8Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VTP with more dataWord Error Rate (WER)30.7Unverified
2CTC/AttentionWord Error Rate (WER)19.1Unverified
#ModelMetricClaimedVerifiedStatus
1VTP with more dataWord Error Rate (WER)22.6Unverified