Lipreading
Lipreading is a process of extracting speech by watching lip movements of a speaker in the absence of sound. Humans lipread all the time without even noticing. It is a big part in communication albeit not as dominant as audio. It is a very helpful skill to learn especially for those who are hard of hearing.
Deep Lipreading is the process of extracting speech from a video of a silent talking face using deep neural networks. It is also known by few other names: Visual Speech Recognition (VSR), Machine Lipreading, Automatic Lipreading etc.
The primary methodology involves two stages: i) Extracting visual and temporal features from a sequence of image frames from a silent talking video ii) Processing the sequence of features into units of speech e.g. characters, words, phrases etc. We can find several implementations of this methodology either done in two separate stages or trained end-to-end in one go.
Papers
Showing 1–10 of 103 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Conv-seq2seq | Word Error Rate (WER) | 60.1 | — | Unverified |
| 2 | CTC + KD | Word Error Rate (WER) | 59.8 | — | Unverified |
| 3 | TM-seq2seq | Word Error Rate (WER) | 58.9 | — | Unverified |
| 4 | EG-seq2seq | Word Error Rate (WER) | 57.8 | — | Unverified |
| 5 | CTC-V2P | Word Error Rate (WER) | 55.1 | — | Unverified |
| 6 | Hyb + Conformer | Word Error Rate (WER) | 43.3 | — | Unverified |
| 7 | VTP | Word Error Rate (WER) | 40.6 | — | Unverified |
| 8 | ES³ Base | Word Error Rate (WER) | 40.3 | — | Unverified |
| 9 | ES³ Large | Word Error Rate (WER) | 37.1 | — | Unverified |
| 10 | RNN-T | Word Error Rate (WER) | 33.6 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | LIBS | Word Error Rate (WER) | 65.29 | — | Unverified |
| 2 | TM-CTC + extLM | Word Error Rate (WER) | 54.7 | — | Unverified |
| 3 | CTC + KD ASR | Word Error Rate (WER) | 53.2 | — | Unverified |
| 4 | Conv-seq2seq | Word Error Rate (WER) | 51.7 | — | Unverified |
| 5 | Hybrid CTC / Attention | Word Error Rate (WER) | 50 | — | Unverified |
| 6 | LF-MMI TDNN | Word Error Rate (WER) | 48.86 | — | Unverified |
| 7 | TM-seq2seq + extLM | Word Error Rate (WER) | 48.3 | — | Unverified |
| 8 | Multi-head Visual-Audio Memory | Word Error Rate (WER) | 44.5 | — | Unverified |
| 9 | MoCo + wav2vec (w/o extLM) | Word Error Rate (WER) | 43.2 | — | Unverified |
| 10 | CTC/Attention | Word Error Rate (WER) | 32.9 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | SyncVSR (Word Boundary) | Top-1 Accuracy | 95 | — | Unverified |
| 2 | 3D Conv + ResNet-18 + DC-TCN + KD (Ensemble & Word Boundary) | Top-1 Accuracy | 94.1 | — | Unverified |
| 3 | SyncVSR | Top-1 Accuracy | 93.2 | — | Unverified |
| 4 | AVCRFormer | Top-1 Accuracy | 89.57 | — | Unverified |
| 5 | 3D Conv + EfficientNetV2 + Transformer + TCN | Top-1 Accuracy | 89.52 | — | Unverified |
| 6 | Vosk + MediaPipe + LS + MixUp + SA + 3DResNet-18 + BiLSTM + Cosine WR | Top-1 Accuracy | 88.7 | — | Unverified |
| 7 | 3D Conv + ResNet-18 + MS-TCN + Multi-Head Visual-Audio Memory | Top-1 Accuracy | 88.5 | — | Unverified |
| 8 | 3D Conv + ResNet-18 + MS-TCN + KD (Ensemble) | Top-1 Accuracy | 88.5 | — | Unverified |
| 9 | 3D-ResNet + Bi-GRU + MixUp + Label Smoothing + Cosine LR (Word Boundary) | Top-1 Accuracy | 88.4 | — | Unverified |
| 10 | 3D-ResNet + Bi-GRU + MixUp + Label Smoothing + Cosine LR | Top-1 Accuracy | 85.5 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | SyncVSR (Word Boundary) | Top-1 Accuracy | 58.2 | — | Unverified |
| 2 | 3D-ResNet + Bi-GRU + MixUp + Label Smooth + Cosine LR (Word Boundary) | Top-1 Accuracy | 55.7 | — | Unverified |
| 3 | 3D Conv + ResNet-18 + MS-TCN + Multi-Head Visual-Audio Memory | Top-1 Accuracy | 53.8 | — | Unverified |
| 4 | 3D Conv + ResNet-18 + Bi-GRU + Visual-Audio Memory | Top-1 Accuracy | 50.82 | — | Unverified |
| 5 | 3D-ResNet + Bi-GRU + MixUp + Label Smooth + Cosine LR | Top-1 Accuracy | 48.3 | — | Unverified |
| 6 | 3D Conv + ResNet-18 + Bi-GRU (Face Cutout) | Top-1 Accuracy | 45.24 | — | Unverified |
| 7 | DFTN | Top-1 Accuracy | 41.93 | — | Unverified |
| 8 | GLMIM | Top-1 Accuracy | 38.79 | — | Unverified |
| 9 | PCPG | Top-1 Accuracy | 38.7 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | WAS | CER | 38.93 | — | Unverified |
| 2 | LipCH-Net | CER | 34.07 | — | Unverified |
| 3 | CSSMCM | CER | 32.48 | — | Unverified |
| 4 | LIBS | CER | 31.27 | — | Unverified |
| 5 | CTC/Attention | CER | 9.1 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | LipNet | Word Error Rate (WER) | 4.6 | — | Unverified |
| 2 | WAS | Word Error Rate (WER) | 3 | — | Unverified |
| 3 | LCANet | Word Error Rate (WER) | 2.9 | — | Unverified |
| 4 | LipNet (with Face Cutout) | Word Error Rate (WER) | 2.9 | — | Unverified |
| 5 | CTC/Attention | Word Error Rate (WER) | 1.2 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | 3D Conv + ResNet-18 + MS-TCN | Top-1 Accuracy | 41.4 | — | Unverified |
| 2 | 3D Conv + ResNet-34 + Bi-GRU | Top-1 Accuracy | 38.19 | — | Unverified |
| 3 | DenseNet3D + Bi-GRU | Top-1 Accuracy | 34.76 | — | Unverified |
| 4 | Multi-Tower LSTM-5 | Top-1 Accuracy | 25.76 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | ES³ Base* | Word Error Rate (WER) | 55.6 | — | Unverified |