Updated Corpora and Benchmarks for Long-Form Speech Recognition

2023-09-26Code Available1· sign in to hype

Jennifer Drexler Fox, Desh Raj, Natalie Delworth, Quinn McNamara, Corey Miller, Migüel Jetté

Code Available — Be the first to reproduce this paper.

Code

github.com/revdotcom/speech-datasets
OfficialIn papernone★ 132

Abstract

The vast majority of ASR research uses corpora in which both the training and test data have been pre-segmented into utterances. In most real-word ASR use-cases, however, test audio is not segmented, leading to a mismatch between inference-time conditions and models trained on segmented utterances. In this paper, we re-release three standard ASR corpora - TED-LIUM 3, Gigapeech, and VoxPopuli-en - with updated transcription and alignments to enable their use for long-form ASR research. We use these reconstituted corpora to study the train-test mismatch problem for transducers and attention-based encoder-decoders (AEDs), confirming that AEDs are more susceptible to this issue. Finally, we benchmark a simple long-form training for these models, showing its efficacy for model robustness under this domain shift.

Tasks

Form speech-recognition Speech Recognition

Updated Corpora and Benchmarks for Long-Form Speech Recognition

Code

Abstract

Tasks

Reproductions