VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, Ignacio Lopez Moreno
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/maum-ai/voicefilterpytorch★ 1,195
- github.com/Edresson/VoiceSplitpytorch★ 266
- github.com/kooBH/VFWSpytorch★ 0
- github.com/HeliosX7/voice-filtertf★ 0
- github.com/jain-abhinav02/VoiceFiltertf★ 0
Abstract
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.