CR-CTC: Consistency regularization on CTC for improved speech recognition

2024-10-07Code Available0· sign in to hype

Zengwei Yao, Wei Kang, Xiaoyu Yang, Fangjun Kuang, Liyong Guo, Han Zhu, Zengrui Jin, Zhaoqing Li, Long Lin, Daniel Povey

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/k2-fsa/icefall
OfficialIn paperpytorch★ 1,379

Abstract

Connectionist Temporal Classification (CTC) is a widely used method for automatic speech recognition (ASR), renowned for its simplicity and computational efficiency. However, it often falls short in recognition performance. In this work, we propose the Consistency-Regularized CTC (CR-CTC), which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram. We provide in-depth insights into its essential behaviors from three perspectives: 1) it conducts self-distillation between random pairs of sub-models that process different augmented views; 2) it learns contextual representation through masked prediction for positions within time-masked regions, especially when we increase the amount of time masking; 3) it suppresses the extremely peaky CTC distributions, thereby reducing overfitting and improving the generalization ability. Extensive experiments on LibriSpeech, Aishell-1, and GigaSpeech datasets demonstrate the effectiveness of our CR-CTC. It significantly improves the CTC performance, achieving state-of-the-art results comparable to those attained by transducer or systems combining CTC and attention-based encoder-decoder (CTC/AED). We release our code at https://github.com/k2-fsa/icefall.

Tasks

Automatic Speech Recognition Automatic Speech Recognition (ASR)Computational Efficiency Decoder speech-recognition Speech Recognition

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
AISHELL-1	Zipformer+CR-CTC (no external language model)	Word Error Rate (WER)	4.02	—	Unverified
GigaSpeech DEV	Zipformer+pruned transducer (no external language model)	Word Error Rate (WER)	10.09	—	Unverified
GigaSpeech DEV	Zipformer+CR-CTC (no external language model)	Word Error Rate (WER)	10.15	—	Unverified
GigaSpeech DEV	Zipformer+pruned transducer w/ CR-CTC (no external language model)	Word Error Rate (WER)	9.95	—	Unverified
GigaSpeech TEST	Zipformer+CR-CTC (no external language model)	Word Error Rate (WER)	10.28	—	Unverified
GigaSpeech TEST	Zipformer+pruned transducer w/ CR-CTC (no external language model)	Word Error Rate (WER)	10.03	—	Unverified
GigaSpeech TEST	Zipformer+CR-CTC/AED (no external language model)	Word Error Rate (WER)	10.07	—	Unverified
GigaSpeech TEST	Zipformer+pruned transducer (no external language model)	Word Error Rate (WER)	10.2	—	Unverified
LibriSpeech test-clean	Zipformer+pruned transducer w/ CR-CTC (no external language model)	Word Error Rate (WER)	1.88	—	Unverified
LibriSpeech test-clean	Zipformer+CR-CTC (no external language model)	Word Error Rate (WER)	2.02	—	Unverified
LibriSpeech test-other	Zipformer+CR-CTC (no external language model)	Word Error Rate (WER)	4.35	—	Unverified
LibriSpeech test-other	Zipformer+pruned transducer w/ CR-CTC (no external language model)	Word Error Rate (WER)	3.95	—	Unverified

CR-CTC: Consistency regularization on CTC for improved speech recognition

Code

Abstract

Tasks

Benchmark Results

Reproductions