SOTAVerified

Visual Speech Recognition

Papers

Showing 150 of 182 papers

TitleStatusHype
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech RecognitionCode3
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech ProcessingCode3
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and TranslationCode3
MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text TranslationCode2
Visual Speech Recognition for Multiple Languages in the WildCode2
Large Language Models are Strong Audio-Visual Speech Recognition LearnersCode2
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token SynchronizationCode2
Robust Self-Supervised Audio-Visual Speech RecognitionCode2
CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative SynchronizationCode2
Auto-AVSR: Audio-Visual Speech Recognition with Automatic LabelsCode2
The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023Code1
Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability ScoringCode1
Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech RecognitionCode1
Zero-shot keyword spotting for visual speech recognition in-the-wildCode1
Tailored Design of Audio-Visual Speech Recognition Models using BranchformersCode1
Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from WhisperCode1
Watch Your Mouth: Silent Speech Recognition with Depth SensingCode1
OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality AlignmentCode1
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech TokensCode1
Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery DetectionCode1
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech RepresentationCode1
OLKAVS: An Open Large-Scale Korean Audio-Visual Speech DatasetCode1
Should we hard-code the recurrence concept or learn it instead ? Exploring the Transformer architecture for Audio-Visual Speech RecognitionCode1
Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech RecognitionCode1
MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech RecognitionCode1
AV Taris: Online Audio-Visual Speech RecognitionCode1
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task GeneralizationCode1
Learn an Effective Lip Reading Model without PainsCode1
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech RepresentationsCode1
CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command RecognitionCode1
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech RecognitionCode1
Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech RecognitionCode1
End-to-end Audio-visual Speech Recognition with ConformersCode1
AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech RecognitionCode1
Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech RecognitionCode1
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion EncoderCode1
Jointly Learning Visual and Auditory Speech Representations from Raw DataCode1
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation ModelsCode1
CI-AVSR: A Cantonese Audio-Visual Speech Datasetfor In-car Command RecognitionCode1
MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth InformationCode1
Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech RecognitionCode1
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and RecognitionCode1
Do VSR Models Generalize Beyond LRS3?Code1
Deep Audio-Visual Speech RecognitionCode1
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech RepresentationCode1
How to Teach DNNs to Pay Attention to the Visual Modality in Speech RecognitionCode1
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech RecognitionCode1
Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides0
Audio-visual Recognition of Overlapped speech for the LRS2 dataset0
Building a synchronous corpus of acoustic and 3D facial marker data for adaptive audio-visual speech synthesis0
Show:102550
← PrevPage 1 of 4Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VTP with more dataWord Error Rate (WER)30.7Unverified
2CTC/AttentionWord Error Rate (WER)19.1Unverified
#ModelMetricClaimedVerifiedStatus
1VTP with more dataWord Error Rate (WER)22.6Unverified