Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

2018-03-23ICML 2018Code Available1· sign in to hype

Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/keonlee9420/Cross-Speaker-Emotion-Transfer
pytorch★ 194
github.com/hash2430/pitchtron
pytorch★ 157
github.com/acetylSv/GST-tacotron
tf★ 61
github.com/foamliu/GST-Tacotron-v2
pytorch★ 0
github.com/jinhan/tacotron2-gst
pytorch★ 0
github.com/syang1993/gst-tacotron
tf★ 0
github.com/Kyubyong/expressive_tacotron
tf★ 0
github.com/KinglittleQ/GST-Tacotron
pytorch★ 0
github.com/CODEJIN/GST_Tacotron
tf★ 0
github.com/cnlinxi/style-token_tacotron2
tf★ 0

Abstract

In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to control synthesis in novel ways, such as varying speed and speaking style - independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

Tasks

Speech Synthesis Style Transfer Text-To-Speech Synthesis

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

Code

Abstract

Tasks

Reproductions