SOTAVerified

Visual Speech Recognition

Papers

Showing 150 of 182 papers

TitleStatusHype
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech RecognitionCode3
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and TranslationCode3
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech ProcessingCode3
CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative SynchronizationCode2
Large Language Models are Strong Audio-Visual Speech Recognition LearnersCode2
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token SynchronizationCode2
Auto-AVSR: Audio-Visual Speech Recognition with Automatic LabelsCode2
MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text TranslationCode2
Visual Speech Recognition for Multiple Languages in the WildCode2
Robust Self-Supervised Audio-Visual Speech RecognitionCode2
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech TokensCode1
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech RepresentationsCode1
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation ModelsCode1
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech RepresentationCode1
AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech RecognitionCode1
Tailored Design of Audio-Visual Speech Recognition Models using BranchformersCode1
Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech RecognitionCode1
Watch Your Mouth: Silent Speech Recognition with Depth SensingCode1
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech RecognitionCode1
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech RepresentationCode1
The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023Code1
Do VSR Models Generalize Beyond LRS3?Code1
Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from WhisperCode1
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion EncoderCode1
Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech RecognitionCode1
MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech RecognitionCode1
OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality AlignmentCode1
MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth InformationCode1
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task GeneralizationCode1
Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech RecognitionCode1
Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability ScoringCode1
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and RecognitionCode1
OLKAVS: An Open Large-Scale Korean Audio-Visual Speech DatasetCode1
Jointly Learning Visual and Auditory Speech Representations from Raw DataCode1
Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech RecognitionCode1
CI-AVSR: A Cantonese Audio-Visual Speech Datasetfor In-car Command RecognitionCode1
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech RecognitionCode1
CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command RecognitionCode1
End-to-end Audio-visual Speech Recognition with ConformersCode1
Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery DetectionCode1
AV Taris: Online Audio-Visual Speech RecognitionCode1
Learn an Effective Lip Reading Model without PainsCode1
Should we hard-code the recurrence concept or learn it instead ? Exploring the Transformer architecture for Audio-Visual Speech RecognitionCode1
How to Teach DNNs to Pay Attention to the Visual Modality in Speech RecognitionCode1
Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech RecognitionCode1
Deep Audio-Visual Speech RecognitionCode1
Zero-shot keyword spotting for visual speech recognition in-the-wildCode1
VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis0
ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition0
Cocktail-Party Audio-Visual Speech Recognition0
Show:102550
← PrevPage 1 of 4Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VTP with more dataWord Error Rate (WER)30.7Unverified
2CTC/AttentionWord Error Rate (WER)19.1Unverified
#ModelMetricClaimedVerifiedStatus
1VTP with more dataWord Error Rate (WER)22.6Unverified