WaveNet: A Generative Model for Raw Audio

2016-09-12Code Available1· sign in to hype

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/Baichenjia/Tensorflow-TCN
tf★ 129
github.com/albarji/neurowriter
tf★ 95
github.com/PeihaoChen/regnet
pytorch★ 54
github.com/Chasm4359/ProTS
pytorch★ 10
github.com/zhong110020/keras-tcn
tf★ 5
github.com/LucaHermes/lightweight-motion-forecasting
tf★ 4
github.com/Talk2Levi/DJL
tf★ 3
github.com/isadrtdinov/wavenet
pytorch★ 2
github.com/sriharireddypusapati/speech-to-text-wavenet2
tf★ 0
github.com/adityaagrawal7/speech-to-text-wavenet
tf★ 0

Abstract

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

Tasks

Audio Generation model Phoneme Recognition Speech Synthesis text-to-speech Text to Speech

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Mandarin Chinese	WaveNet (L+F)	Mean Opinion Score	4.08	—	Unverified
Mandarin Chinese	LSTM-RNN parametric	Mean Opinion Score	3.79	—	Unverified
Mandarin Chinese	HMM-driven concatenative	Mean Opinion Score	3.47	—	Unverified
North American English	WaveNet (L+F)	Mean Opinion Score	4.21	—	Unverified
North American English	HMM-driven concatenative	Mean Opinion Score	3.86	—	Unverified
North American English	LSTM-RNN parametric	Mean Opinion Score	3.67	—	Unverified

WaveNet: A Generative Model for Raw Audio

Code

Abstract

Tasks

Benchmark Results

Reproductions