Understanding the Role of Self Attention for Efficient Speech Recognition

2021-09-29ICLR 2022Unverified0· sign in to hype

Kyuhong Shim, Jungwook Choi, Wonyong Sung

Unverified — Be the first to reproduce this paper.

Abstract

Self-attention (SA) is a critical component of Transformer neural networks that have succeeded in automatic speech recognition (ASR). However, its computational cost increases quadratically with the sequence length, which is especially problematic in ASR. In this paper, we analyze the role of SA in Transformer-based ASR models for improving efficiency. We reveal that SA performs two distinct roles: Phonetic and linguistic localization. We propose a novel metric called phoneme attention relationship (PAR) to investigate that phonetic localization in the lower layers extracts phonologically meaningful features from speech and standardizes the phonetic variance in the utterance for proper linguistic localization in the upper layers. From this understanding, we discover that attention maps can be reused as long as their localization capability is preserved. To evaluate this idea, we implement the layer-wise attention map reuse on real GPU platforms and achieve up to 1.96 times speedup in inference and 33% savings in training time with noticeably improved ASR performance for the challenging benchmark on LibriSpeech dev/test-other dataset.

Tasks

Automatic Speech Recognition Automatic Speech Recognition (ASR)GPU speech-recognition Speech Recognition

Understanding the Role of Self Attention for Efficient Speech Recognition

Abstract

Tasks

Reproductions