MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets

2022-11-14Code Available1· sign in to hype

Ziyang Ma, Zhisheng Zheng, Changli Tang, Yujin Wang, Xie Chen

Code Available — Be the first to reproduce this paper.

Code

github.com/ddlbojack/mt4ssl
OfficialIn paperpytorch★ 45

Abstract

In this paper, we provide a new perspective on self-supervised speech models from how the training targets are obtained. We generalize the targets extractor into Offline Targets Extractor (Off-TE) and Online Targets Extractor (On-TE). Based on this, we propose a new multi-tasking learning framework for self-supervised learning, MT4SSL, which stands for Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets. MT4SSL uses the K-means algorithm as an Off-TE and a teacher network without gradients as an On-TE, respectively. Our model outperforms previous SSL methods by nontrivial margins on the LibriSpeech benchmark, and is comparable to or even better than the best-performing models with fewer data. Furthermore, we find that using both Off-TE and On-TE results in better convergence in the pre-training phase. With both effectiveness and efficiency, we think doing multi-task learning on self-supervised speech models from our perspective is a promising trend.

Tasks

Automatic Speech Recognition Multi-Task Learning Representation Learning Self-Supervised Learning Speech Recognition Speech Representation Learning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
LibriSpeech test-clean	MT4SSL	Word Error Rate (WER)	3.4	—	Unverified
LibriSpeech test-other	MT4SSL	Word Error Rate (WER)	9.6	—	Unverified

MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets

Code

Abstract

Tasks

Benchmark Results

Reproductions