Lipreading

Lipreading is a process of extracting speech by watching lip movements of a speaker in the absence of sound. Humans lipread all the time without even noticing. It is a big part in communication albeit not as dominant as audio. It is a very helpful skill to learn especially for those who are hard of hearing.

Deep Lipreading is the process of extracting speech from a video of a silent talking face using deep neural networks. It is also known by few other names: Visual Speech Recognition (VSR), Machine Lipreading, Automatic Lipreading etc.

The primary methodology involves two stages: i) Extracting visual and temporal features from a sequence of image frames from a silent talking video ii) Processing the sequence of features into units of speech e.g. characters, words, phrases etc. We can find several implementations of this methodology either done in two separate stages or trained end-to-end in one go.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–10 of 103 papers

Title	Date	Tasks	Status	Hype
Learning Speaker-Invariant Visual Features for Lipreading	Jun 9, 2025	DisentanglementLipreading	—Unverified	0
UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation	Jun 4, 2025	cross-modal alignmentLipreading	—Unverified	0
OXSeg: Multidimensional attention UNet-based lip segmentation using semi-supervised lip contours	May 8, 2025	Generative Adversarial NetworkLipreading	—Unverified	0
Target Speaker Lipreading by Audio-Visual Self-Distillation Pretraining and Speaker Adaptation	Feb 9, 2025	Cross-Lingual TransferLipreading	—Unverified	0
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models	Feb 9, 2025	Audio-Visual Speech RecognitionAutomatic Speech Recognition	CodeCode Available	1
Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions	Feb 1, 2025	Lipreadingspeech-recognition	CodeCode Available	0
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs	Nov 4, 2024	Lipreadingspeech-recognition	CodeCode Available	1
RAL:Redundancy-Aware Lipreading Model Based on Differential Learning with Symmetric Views	Sep 9, 2024	LipreadingLip Reading	—Unverified	0
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization	Jun 18, 2024	Landmark-based LipreadingLipreading	CodeCode Available	2
Watch Your Mouth: Silent Speech Recognition with Depth Sensing	May 11, 2024	Deep LearningLipreading	CodeCode Available	1

Show:10 25 50

← PrevPage 1 of 11Next →

All datasets LRS3-TED LRS2 Lip Reading in the Wild CAS-VSR-W1k (LRW-1000)CMLR GRID corpus (mixed-speech)LRW-1000 CAS-VSR-S101

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Conv-seq2seq	Word Error Rate (WER)	60.1	—	Unverified
2	CTC + KD	Word Error Rate (WER)	59.8	—	Unverified
3	TM-seq2seq	Word Error Rate (WER)	58.9	—	Unverified
4	EG-seq2seq	Word Error Rate (WER)	57.8	—	Unverified
5	CTC-V2P	Word Error Rate (WER)	55.1	—	Unverified
6	Hyb + Conformer	Word Error Rate (WER)	43.3	—	Unverified
7	VTP	Word Error Rate (WER)	40.6	—	Unverified
8	ES³ Base	Word Error Rate (WER)	40.3	—	Unverified
9	ES³ Large	Word Error Rate (WER)	37.1	—	Unverified
10	RNN-T	Word Error Rate (WER)	33.6	—	Unverified