Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

2023-06-01Code Available4· sign in to hype

Hubert Siuzdak

Code Available — Be the first to reproduce this paper.

Code

github.com/gemelo-ai/vocos
OfficialIn paperpytorch★ 1,087
github.com/collabora/whisperspeech
pytorch★ 4,576
github.com/whisperspeech/whisperspeech
pytorch★ 4,576
github.com/IAHispano/Applio/tree/exp/vocoders/rvc/lib/algorithm
pytorch★ 0

Abstract

Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at https://github.com/gemelo-ai/vocos.

Tasks

Audio Synthesis Computational Efficiency Inductive Bias Speech Synthesis

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
LibriTTS	Vocos	PESQ	3.7	—	Unverified

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Code

Abstract

Tasks

Benchmark Results

Reproductions