Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

2022-06-02Code Available2· sign in to hype

Sehoon Kim, Amir Gholami, Albert Shaw, Nicholas Lee, Karttikeya Mangalam, Jitendra Malik, Michael W. Mahoney, Kurt Keutzer

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/kssteven418/squeezeformer
OfficialIn papertf★ 264
github.com/upskyy/Squeezeformer
pytorch★ 148
github.com/msalhab96/SpeeQ
pytorch★ 51
github.com/NVIDIA/NeMo/tree/main/examples/asr/conf/squeezeformer
pytorch★ 0

Abstract

The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After re-examining the design choices for both the macro and micro-architecture of Conformer, we propose Squeezeformer which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of multi-head attention or convolution modules followed up by feed-forward module instead of the Macaron structure proposed in Conformer. Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the activations in the convolutional block, (ii) removes redundant Layer Normalization operations, and (iii) incorporates an efficient depthwise down-sampling layer to efficiently sub-sample the input signal. Squeezeformer achieves state-of-the-art results of 7.5%, 6.5%, and 6.0% word-error-rate (WER) on LibriSpeech test-other without external language models, which are 3.1%, 1.4%, and 0.6% better than Conformer-CTC with the same number of FLOPs. Our code is open-sourced and available online.

Tasks

Automatic Speech Recognition Automatic Speech Recognition (ASR)Speech Recognition

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
LibriSpeech test-clean	Squeezeformer (L)	Word Error Rate (WER)	2.47	—	Unverified
LibriSpeech test-other	Squeezeformer (L)	Word Error Rate (WER)	5.97	—	Unverified

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Code

Abstract

Tasks

Benchmark Results

Reproductions