End-to-end streaming model for low-latency speech anonymization

2024-06-13Unverified0· sign in to hype

Waris Quamer, Ricardo Gutierrez-Osuna

Unverified — Be the first to reproduce this paper.

Abstract

Speaker anonymization aims to conceal cues to speaker identity while preserving linguistic content. Current machine learning based approaches require substantial computational resources, hindering real-time streaming applications. To address these concerns, we propose a streaming model that achieves speaker anonymization with low latency. The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder that extracts HuBERT-like information, a pretrained speaker encoder that extract speaker identity, and a variance encoder that injects pitch and energy information. These three disentangled representations are fed to a decoder that re-synthesizes the speech signal. We present evaluation results from two implementations of our system, a full model that achieves a latency of 230ms, and a lite version (0.1x in size) that further reduces latency to 66ms while maintaining state-of-the-art performance in naturalness, intelligibility, and privacy preservation.

Tasks

Decoder Speaker anonymization

End-to-end streaming model for low-latency speech anonymization

Abstract

Tasks

Reproductions