Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard

2020-01-20Unverified0· sign in to hype

Zoltán Tüske, George Saon, Kartik Audhkhasi, Brian Kingsbury

Unverified — Be the first to reproduce this paper.

Abstract

It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training. In this paper, we show that state-of-the-art recognition performance can be achieved on the Switchboard-300 database using a single headed attention, LSTM based model. Using a cross-utterance language model, our single-pass speaker independent system reaches 6.4% and 12.5% word error rate (WER) on the Switchboard and CallHome subsets of Hub5'00, without a pronunciation lexicon. While careful regularization and data augmentation are crucial in achieving this level of performance, experiments on Switchboard-2000 show that nothing is more useful than more data. Overall, the combination of various regularizations and a simple but fairly large model results in a new state of the art, 4.7% and 7.8% WER on the Switchboard and CallHome sets, using SWB-2000 without any external data resources.

Tasks

Data Augmentation Language Modeling Language Modelling speech-recognition Speech Recognition

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
swb_hub_500 WER fullSWBCH	IBM (LSTM encoder-decoder)	Percentage error	7.8	—	Unverified
Switchboard + Hub500	IBM (LSTM encoder-decoder)	Percentage error	4.7	—	Unverified

Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard

Abstract

Tasks

Benchmark Results

Reproductions