Inter-Speaker Relative Cues for Text-Guided Target Speech Extraction

2025-06-02Unverified0· sign in to hype

Wang Dai, Archontis Politis, Tuomas Virtanen

Unverified — Be the first to reproduce this paper.

Abstract

We propose a novel approach that utilizes inter-speaker relative cues to distinguish target speakers and extract their voices from mixtures. Continuous cues (e.g., temporal order, age, pitch level) are grouped by relative differences, while discrete cues (e.g., language, gender, emotion) retain their categorical distinctions. Compared to fixed speech attribute classification, inter-speaker relative cues offer greater flexibility, facilitating much easier expansion of text-guided target speech extraction datasets. Our experiments show that combining all relative cues yields better performance than random subsets, with gender and temporal order being the most robust across languages and reverberant conditions. Additional cues, such as pitch level, loudness, distance, speaking duration, language, and pitch range, also demonstrate notable benefits in complex scenarios. Fine-tuning pre-trained WavLM Base+ CNN encoders improves overall performance over the Conv1d baseline.

Tasks

Attribute Speech Extraction

Inter-Speaker Relative Cues for Text-Guided Target Speech Extraction

Abstract

Tasks

Reproductions