Wanna hear your voice? A sample is all we need!

2024-10-01Unverified0· sign in to hype

The Hieu Pham, Phuong Thanh Tran Nguyen, Xuan Tho Nguyen, Tan Dat Nguyen, Duc Dung Nguyen

Unverified — Be the first to reproduce this paper.

Abstract

Research on audio clue-based target speaker extraction (TSE) has focused on modeling mixtures and reference speech, achieving strong results in English due to abundant datasets. However, cross-lingual properties remain underexplored, as low-resource languages face challenges from limited annotated data and linguistic resources. To bridge this gap, we propose WHYV (Wanna Hear Your Voice), a cross-lingual TSE framework enabling zero-shot adaptation without fine-tuning. WHYV employs a frequency-modulated gating mechanism that dynamically adjusts the acoustic features of the target speaker, minimizing reliance on language-specific cues. Evaluations demonstrate state-of-the-art zero-shot performance: 13.8 dB (Libri2Mix mix-both), 18.1 dB (mix-clean), and 14.8 dB on Vietnamese data.

Tasks

All Speech Separation Target Speaker Extraction

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Libri2Mix	WHYV	SI-SDRi	17.5	—	Unverified
WHAM!	WHYV	SI-SDRi	12.96	—	Unverified

Wanna hear your voice? A sample is all we need!

Abstract

Tasks

Benchmark Results

Reproductions