DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

2023-05-17NeurIPS 2023Code Available1· sign in to hype

Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, James R. Glass

Code Available — Be the first to reproduce this paper.

Code

github.com/alexander-h-liu/dinosr
OfficialIn paperpytorch★ 54

Abstract

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units.

Tasks

Clustering Language Modeling Language Modelling Masked Language Modeling Online Clustering Representation Learning Speech Representation Learning

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

Code

Abstract

Tasks

Reproductions