Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, Mohammad Norouzi
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/NoaCahan/WavenetAutoEncoderpytorch★ 12
- github.com/spear011/scm-datasettf★ 1
- github.com/morris-frank/nsynth-pytorchpytorch★ 0
- github.com/facebookresearch/SINGpytorch★ 0
- github.com/MindSpore-scientific/code-6/tree/main/neural-audio-synthesis-wavenetmindspore★ 0
- github.com/MindCode-4/code-8/tree/main/neural-audio-synthesis-wavenetmindspore★ 0
- github.com/Saran-nns/Edge-computing-with-tftf★ 0
- github.com/JoshuaLeland/WaveNetEncoderContinuouspytorch★ 0
Abstract
Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contributions in both these areas to enable similar progress in audio modeling. First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform. Second, we introduce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally, we show that the model learns a manifold of embeddings that allows for morphing between instruments, meaningfully interpolating in timbre to create new types of sounds that are realistic and expressive.