Combining Residual Networks with LSTMs for Lipreading

2017-03-12Code Available0· sign in to hype

Themos Stafylakis, Georgios Tzimiropoulos

Code Available — Be the first to reproduce this paper.

Code

github.com/tstafylakis/Lipreading-ResNet
OfficialIn paperpytorch★ 0
github.com/michaeltrs/Lipreading_ResNet_LSTM
tf★ 0
github.com/manideep2510/Lipreading-Keras
tf★ 0
github.com/gaalszandi/visual_speech_recognition
none★ 0

Abstract

We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0, yielding 6.8 absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing.

Tasks

Lipreading Lip Reading speech-recognition Speech Recognition Visual Speech Recognition

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Lip Reading in the Wild	3D Conv + ResNet-34 + Bi-LSTM	Top-1 Accuracy	83	—	Unverified

Combining Residual Networks with LSTMs for Lipreading

Code

Abstract

Tasks

Benchmark Results

Reproductions