wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

2020-06-20NeurIPS 2020Code Available3· sign in to hype

Alexei Baevski, Henry Zhou, Abdel-rahman Mohamed, Michael Auli

Code Available — Be the first to reproduce this paper.

Code

github.com/sh-lee-prml/hierspeechpp
pytorch★ 1,242
github.com/facebookresearch/brainmagick
pytorch★ 462
github.com/mailong25/vietnamese-speech-recognition
pytorch★ 379
github.com/mailong25/self-supervised-speech-recognition
pytorch★ 379
github.com/huseinzol05/malaya-speech
tf★ 283
github.com/neonbjb/ocotillo
pytorch★ 254
github.com/shivangi-aneja/FaceTalk
pytorch★ 238
github.com/eastonYi/wav2vec
pytorch★ 170
github.com/vasudevgupta7/gsoc-wav2vec2
tf★ 91
github.com/JoungheeKim/Non-Attentive-Tacotron
pytorch★ 57

Abstract

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.

Tasks

Quantization Self-Supervised Learning Speech Recognition Zero-Shot Audio Retrieval

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Libri-Light test-clean	wav2vec 2.0 Large-10h-LV-60k	Word Error Rate (WER)	2.5	—	Unverified
Libri-Light test-other	wav2vec 2.0 Large-10h-LV-60k	Word Error Rate (WER)	5	—	Unverified
LibriSpeech test-clean	wav2vec 2.0 with Libri-Light	Word Error Rate (WER)	1.8	—	Unverified
LibriSpeech test-other	wav2vec 2.0 with Libri-Light	Word Error Rate (WER)	3	—	Unverified
LibriSpeech test-other	wav2vec 2.0	Word Error Rate (WER)	4.1	—	Unverified
TIMIT	wav2vec 2.0	Percentage error	8.3	—	Unverified

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Code

Abstract

Tasks

Benchmark Results

Reproductions