DiffWave: A Versatile Diffusion Model for Audio Synthesis
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, Bryan Catanzaro
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/lmnt-com/diffwavepytorch★ 889
- github.com/neillu23/cdiffusepytorch★ 250
- github.com/keonlee9420/DiffSingerpytorch★ 247
- github.com/albertfgu/diffwave-sashimipytorch★ 128
- github.com/philsyn/diffwave-vocoderpytorch★ 90
- github.com/philsyn/diffwave-unconditionalpytorch★ 43
- github.com/revsic/tf-diffwavetf★ 42
- github.com/revsic/jax-variational-diffwavejax★ 40
- github.com/neillu23/DiffuSEpytorch★ 36
- github.com/revsic/torch-diffusion-waveganpytorch★ 17
Abstract
In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| LJSpeech | DiffWave LARGE | Mean Opinion Score | 4.44 | — | Unverified |