LEARNING PHONEME-LEVEL DISCRETE SPEECH REPRESENTATION WITH WORD-LEVEL SUPERVISION

2021-09-29Unverified0· sign in to hype

Liming Wang, Siyuan Feng, Mark A. Hasegawa-Johnson, Chang D. Yoo

Unverified — Be the first to reproduce this paper.

Abstract

Phonemes are defined by their relationship to words: changing a phoneme changes the word. Learning a phoneme inventory with little supervision has been a long-standing challenge with important applications to under-resourced speech technology. In this paper, we bridge the gap between the linguistic and statistical definition of phonemes and propose a novel neural discrete representation learning model for self-supervised learning of phoneme inventory with raw speech and word labels. Under mild assumptions, we prove that the phoneme inventory learned by our approach converges to the true one with an exponentially low error rate. Moreover, in experiments on TIMIT and Mboshi benchmarks, our approach consistently learns better phoneme-level representation than previous state-of-the-art self-supervised representation learning algorithms and remains effective even in a low-resource scenario.

Tasks

Representation Learning Self-Supervised Learning

LEARNING PHONEME-LEVEL DISCRETE SPEECH REPRESENTATION WITH WORD-LEVEL SUPERVISION

Abstract

Tasks

Reproductions