SOTAVerified

Lipreading

Lipreading is a process of extracting speech by watching lip movements of a speaker in the absence of sound. Humans lipread all the time without even noticing. It is a big part in communication albeit not as dominant as audio. It is a very helpful skill to learn especially for those who are hard of hearing.

Deep Lipreading is the process of extracting speech from a video of a silent talking face using deep neural networks. It is also known by few other names: Visual Speech Recognition (VSR), Machine Lipreading, Automatic Lipreading etc.

The primary methodology involves two stages: i) Extracting visual and temporal features from a sequence of image frames from a silent talking video ii) Processing the sequence of features into units of speech e.g. characters, words, phrases etc. We can find several implementations of this methodology either done in two separate stages or trained end-to-end in one go.

Papers

Showing 51100 of 103 papers

TitleStatusHype
Self-supervised Transformer for Deepfake Detection0
Sign Language Translation in a Healthcare Setting0
Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading0
Sub-word Level Lip Reading With Visual Attention0
Talking Heads, Signing Avatars and Social Robots0
Target Speaker Lipreading by Audio-Visual Self-Distillation Pretraining and Speaker Adaptation0
The speaker-independent lipreading play-off; a survey of lipreading machines0
Towards Lipreading Sentences with Active Appearance Models0
Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale0
Understanding the visual speech signal0
UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation0
Visual gesture variability between talkers in continuous visual speech0
Visual Speech Enhancement0
Visual Speech Language Models0
Visual speech recognition: aligning terminologies for better understanding0
Visual Speech Recognition in a Driver Assistance System0
3D Feature Pyramid Attention Module for Robust Visual Speech Recognition0
Word-level Persian Lipreading Dataset0
A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading0
Accurate and Resource-Efficient Lipreading with Efficientnetv2 and Transformers0
Alternative Visual Units for an Optimized Phoneme-Based Lipreading System0
Analysis of Visual Features for Continuous Lipreading in Spanish0
ASR is all you need: cross-modal distillation for lip reading0
Audio-visual Multi-channel Recognition of Overlapped Speech0
Audio-visual Recognition of Overlapped speech for the LRS2 dataset0
Audio-Visual Speech Enhancement with Score-Based Generative Models0
Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture0
Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading0
Can DNNs Learn to Lipread Full Sentences?0
Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals0
Comparing phonemes and visemes with DNN-based lipreading0
Conformers are All You Need for Visual Speech Recognition0
Cross-Attention Fusion of Visual and Geometric Features for Large Vocabulary Arabic Lipreading0
Decoding visemes: improving machine lipreading0
Decoding visemes: improving machine lipreading0
End-to-End Multi-View Lipreading0
Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder0
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations0
FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire0
Improving Speaker-Independent Lipreading with Domain-Adversarial Training0
Investigating the dynamics of hand and lips in French Cued Speech using attention mechanisms and CTC-based decoding0
Is Lip Region-of-Interest Sufficient for Lipreading?0
Large-Scale Visual Speech Recognition0
Large-vocabulary Audio-visual Speech Recognition in Noisy Environments0
Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition0
Learning from Videos with Deep Convolutional LSTM Networks0
Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading0
Learning Speaker-Invariant Visual Features for Lipreading0
LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers0
Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models0
Show:102550
← PrevPage 2 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Conv-seq2seqWord Error Rate (WER)60.1Unverified
2CTC + KDWord Error Rate (WER)59.8Unverified
3TM-seq2seqWord Error Rate (WER)58.9Unverified
4EG-seq2seqWord Error Rate (WER)57.8Unverified
5CTC-V2PWord Error Rate (WER)55.1Unverified
6Hyb + ConformerWord Error Rate (WER)43.3Unverified
7VTPWord Error Rate (WER)40.6Unverified
8ES³ BaseWord Error Rate (WER)40.3Unverified
9ES³ LargeWord Error Rate (WER)37.1Unverified
10RNN-TWord Error Rate (WER)33.6Unverified
#ModelMetricClaimedVerifiedStatus
1LIBSWord Error Rate (WER)65.29Unverified
2TM-CTC + extLMWord Error Rate (WER)54.7Unverified
3CTC + KD ASRWord Error Rate (WER)53.2Unverified
4Conv-seq2seqWord Error Rate (WER)51.7Unverified
5Hybrid CTC / AttentionWord Error Rate (WER)50Unverified
6LF-MMI TDNNWord Error Rate (WER)48.86Unverified
7TM-seq2seq + extLMWord Error Rate (WER)48.3Unverified
8Multi-head Visual-Audio MemoryWord Error Rate (WER)44.5Unverified
9MoCo + wav2vec (w/o extLM)Word Error Rate (WER)43.2Unverified
10CTC/AttentionWord Error Rate (WER)32.9Unverified
#ModelMetricClaimedVerifiedStatus
1SyncVSR (Word Boundary)Top-1 Accuracy95Unverified
23D Conv + ResNet-18 + DC-TCN + KD (Ensemble & Word Boundary)Top-1 Accuracy94.1Unverified
3SyncVSRTop-1 Accuracy93.2Unverified
4AVCRFormerTop-1 Accuracy89.57Unverified
53D Conv + EfficientNetV2 + Transformer + TCNTop-1 Accuracy89.52Unverified
6Vosk + MediaPipe + LS + MixUp + SA + 3DResNet-18 + BiLSTM + Cosine WRTop-1 Accuracy88.7Unverified
73D Conv + ResNet-18 + MS-TCN + Multi-Head Visual-Audio MemoryTop-1 Accuracy88.5Unverified
83D Conv + ResNet-18 + MS-TCN + KD (Ensemble)Top-1 Accuracy88.5Unverified
93D-ResNet + Bi-GRU + MixUp + Label Smoothing + Cosine LR (Word Boundary)Top-1 Accuracy88.4Unverified
103D-ResNet + Bi-GRU + MixUp + Label Smoothing + Cosine LRTop-1 Accuracy85.5Unverified
#ModelMetricClaimedVerifiedStatus
1SyncVSR (Word Boundary)Top-1 Accuracy58.2Unverified
23D-ResNet + Bi-GRU + MixUp + Label Smooth + Cosine LR (Word Boundary)Top-1 Accuracy55.7Unverified
33D Conv + ResNet-18 + MS-TCN + Multi-Head Visual-Audio MemoryTop-1 Accuracy53.8Unverified
43D Conv + ResNet-18 + Bi-GRU + Visual-Audio MemoryTop-1 Accuracy50.82Unverified
53D-ResNet + Bi-GRU + MixUp + Label Smooth + Cosine LRTop-1 Accuracy48.3Unverified
63D Conv + ResNet-18 + Bi-GRU (Face Cutout)Top-1 Accuracy45.24Unverified
7DFTNTop-1 Accuracy41.93Unverified
8GLMIMTop-1 Accuracy38.79Unverified
9PCPGTop-1 Accuracy38.7Unverified
#ModelMetricClaimedVerifiedStatus
1WASCER38.93Unverified
2LipCH-NetCER34.07Unverified
3CSSMCMCER32.48Unverified
4LIBSCER31.27Unverified
5CTC/AttentionCER9.1Unverified
#ModelMetricClaimedVerifiedStatus
1LipNetWord Error Rate (WER)4.6Unverified
2WASWord Error Rate (WER)3Unverified
3LCANetWord Error Rate (WER)2.9Unverified
4LipNet (with Face Cutout)Word Error Rate (WER)2.9Unverified
5CTC/AttentionWord Error Rate (WER)1.2Unverified
#ModelMetricClaimedVerifiedStatus
13D Conv + ResNet-18 + MS-TCNTop-1 Accuracy41.4Unverified
23D Conv + ResNet-34 + Bi-GRUTop-1 Accuracy38.19Unverified
3DenseNet3D + Bi-GRUTop-1 Accuracy34.76Unverified
4Multi-Tower LSTM-5Top-1 Accuracy25.76Unverified
#ModelMetricClaimedVerifiedStatus
1ES³ Base*Word Error Rate (WER)55.6Unverified