Multi-Head State Space Model for Speech Recognition

2023-05-21Unverified0· sign in to hype

Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, Mark J. F. Gales

arXiv PDF

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus. Furthermore, we augment the transformer block with MH-SSMs layers, referred to as the Stateformer, achieving state-of-the-art performance on the LibriSpeech task, with word error rates of 1.76\%/4.37\% on the development and 1.91\%/4.36\% on the test sets without using an external language model.

Tasks

Language Modeling Language Modelling model speech-recognition Speech Recognition State Space Models

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
LibriSpeech test-clean	Stateformer	Word Error Rate (WER)	1.76	—	Unverified

Multi-Head State Space Model for Speech Recognition

Abstract

Tasks

Benchmark Results

Reproductions