ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

2020-05-07Code Available1· sign in to hype

Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, Yonghui Wu

Code Available — Be the first to reproduce this paper.

Code

github.com/openspeech-team/openspeech
pytorch★ 717
github.com/upskyy/ContextNet
pytorch★ 38
github.com/hasangchun/ContextNet
pytorch★ 38
github.com/Cross-Caps/STFADE
tf★ 0

Abstract

Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the previous best published system of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.

Tasks

Automatic Speech Recognition Automatic Speech Recognition (ASR)Language Modeling Language Modelling speech-recognition Speech Recognition

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
LibriSpeech test-clean	ContextNet(L)	Word Error Rate (WER)	1.9	—	Unverified
LibriSpeech test-clean	ContextNet(M)	Word Error Rate (WER)	2	—	Unverified
LibriSpeech test-clean	ContextNet(S)	Word Error Rate (WER)	2.3	—	Unverified
LibriSpeech test-other	ContextNet(L)	Word Error Rate (WER)	4.1	—	Unverified
LibriSpeech test-other	ContextNet(M)	Word Error Rate (WER)	4.5	—	Unverified
LibriSpeech test-other	ContextNet(S)	Word Error Rate (WER)	5.5	—	Unverified

ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

Code

Abstract

Tasks

Benchmark Results

Reproductions