SOTAVerified

Visual Speech Recognition

Papers

Showing 125 of 182 papers

TitleStatusHype
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech RecognitionCode3
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and TranslationCode3
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech ProcessingCode3
CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative SynchronizationCode2
Large Language Models are Strong Audio-Visual Speech Recognition LearnersCode2
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token SynchronizationCode2
Auto-AVSR: Audio-Visual Speech Recognition with Automatic LabelsCode2
MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text TranslationCode2
Visual Speech Recognition for Multiple Languages in the WildCode2
Robust Self-Supervised Audio-Visual Speech RecognitionCode2
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech TokensCode1
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech RepresentationsCode1
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation ModelsCode1
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech RepresentationCode1
AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech RecognitionCode1
Tailored Design of Audio-Visual Speech Recognition Models using BranchformersCode1
Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech RecognitionCode1
Watch Your Mouth: Silent Speech Recognition with Depth SensingCode1
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech RecognitionCode1
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech RepresentationCode1
The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023Code1
Do VSR Models Generalize Beyond LRS3?Code1
Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from WhisperCode1
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion EncoderCode1
MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech RecognitionCode1
Show:102550
← PrevPage 1 of 8Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VTP with more dataWord Error Rate (WER)30.7Unverified
2CTC/AttentionWord Error Rate (WER)19.1Unverified
#ModelMetricClaimedVerifiedStatus
1VTP with more dataWord Error Rate (WER)22.6Unverified