LipNet: End-to-End Sentence-level Lipreading

2016-11-05Code Available1· sign in to hype

Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, Nando de Freitas

Code Available — Be the first to reproduce this paper.

Code

github.com/rizkiarm/LipNet
Officialtf★ 0
github.com/sailordiary/LipNet-PyTorch
pytorch★ 69
github.com/speech-separation-hse/video-features
pytorch★ 0
github.com/ski-net/lipnet
mxnet★ 0
github.com/Abishalini/LipReadingGUI
none★ 0
github.com/Fengdalu/LipNet-PyTorch
pytorch★ 0
github.com/hero9968/lipnet-python
tf★ 0
github.com/ms8909/LipONet
tf★ 0
github.com/LiZhenghua0311/lip
tf★ 0
github.com/SohaibAnwaar/lip-Reading-by-Deep-learning
tf★ 0

Abstract

Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). However, existing work on models trained end-to-end perform only word classification, rather than sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first end-to-end sentence-level lipreading model that simultaneously learns spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 95.2% accuracy in sentence-level, overlapped speaker split task, outperforming experienced human lipreaders and the previous 86.4% word-level state-of-the-art accuracy (Gergen et al., 2016).

Tasks

General Classification Lipreading Sentence

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
GRID corpus (mixed-speech)	LipNet	Word Error Rate (WER)	4.6	—	Unverified

LipNet: End-to-End Sentence-level Lipreading

Code

Abstract

Tasks

Benchmark Results

Reproductions