SOTAVerified

Visual Speech Recognition

Papers

Showing 2650 of 182 papers

TitleStatusHype
AV Taris: Online Audio-Visual Speech RecognitionCode1
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech RecognitionCode1
Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech RecognitionCode1
The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023Code1
CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command RecognitionCode1
How to Teach DNNs to Pay Attention to the Visual Modality in Speech RecognitionCode1
Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech RecognitionCode1
Do VSR Models Generalize Beyond LRS3?Code1
AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech RecognitionCode1
End-to-end Audio-visual Speech Recognition with ConformersCode1
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation ModelsCode1
Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech RecognitionCode1
Tailored Design of Audio-Visual Speech Recognition Models using BranchformersCode1
CI-AVSR: A Cantonese Audio-Visual Speech Datasetfor In-car Command RecognitionCode1
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion EncoderCode1
Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech RecognitionCode1
Learn an Effective Lip Reading Model without PainsCode1
MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech RecognitionCode1
Deep Audio-Visual Speech RecognitionCode1
Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery DetectionCode1
Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from WhisperCode1
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech TokensCode1
Recurrent Neural Network Transducer for Audio-Visual Speech RecognitionCode0
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech RecognitionCode0
Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech RepresentationCode0
Show:102550
← PrevPage 2 of 8Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VTP with more dataWord Error Rate (WER)30.7Unverified
2CTC/AttentionWord Error Rate (WER)19.1Unverified
#ModelMetricClaimedVerifiedStatus
1VTP with more dataWord Error Rate (WER)22.6Unverified