TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context
Nithin Rao Koluguri, Taejin Park, Boris Ginsburg
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/NVIDIA/NeMoOfficialIn paperpytorch★ 16,967
- github.com/Wadaboa/titanetpytorch★ 68
Abstract
In this paper, we propose TitaNet, a novel neural network architecture for extracting speaker representations. We employ 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map variable-length utterances to a fixed-length embedding (t-vector). TitaNet is a scalable architecture and achieves state-of-the-art performance on speaker verification task with an equal error rate (EER) of 0.68% on the VoxCeleb1 trial file and also on speaker diarization tasks with diarization error rate (DER) of 1.73% on AMI-MixHeadset, 1.99% on AMI-Lapel and 1.11% on CH109. Furthermore, we investigate various sizes of TitaNet and present a light TitaNet-S model with only 6M parameters that achieve near state-of-the-art results in diarization tasks.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| AMI Lapel | TitaNet-M (NME-SC) | DER(%) | 1.99 | — | Unverified |
| AMI Lapel | TitaNet-S (NME-SC) | DER(%) | 2 | — | Unverified |
| AMI Lapel | TitaNet-L (NME-SC) | DER(%) | 2.03 | — | Unverified |
| AMI Lapel | ECAPA (SC) | DER(%) | 2.36 | — | Unverified |
| AMI MixHeadset | TitaNet-S (NME-SC) | DER(%) | 2.22 | — | Unverified |
| AMI MixHeadset | TitaNet-M (NME-SC) | DER(%) | 1.79 | — | Unverified |
| AMI MixHeadset | ECAPA (SC) | DER(%) | 1.78 | — | Unverified |
| AMI MixHeadset | TitaNet-L (NME-SC) | DER(%) | 1.73 | — | Unverified |
| CALLHOME-109 | titanet-s | DER(%) | 1.11 | — | Unverified |
| CH109 | TitaNet-M (NME-SC) | DER(%) | 1.13 | — | Unverified |
| CH109 | TitaNet-L (NME-SC) | DER(%) | 1.19 | — | Unverified |
| CH109 | x-vector (PLDA + AHC) | DER(%) | 9.72 | — | Unverified |
| CH109 | TitaNet-S (NME-SC) | DER(%) | 1.11 | — | Unverified |
| NIST-SRE 2000 | x-vector (MCGAN) | DER(%) | 5.73 | — | Unverified |
| NIST-SRE 2000 | x-vector (PLDA + AHC) | DER(%) | 8.39 | — | Unverified |
| NIST-SRE 2000 | TitaNet-L (NME-SC) | DER(%) | 6.73 | — | Unverified |
| NIST-SRE 2000 | TitaNet-M (NME-SC) | DER(%) | 6.47 | — | Unverified |
| NIST-SRE 2000 | TitaNet-S (NME-SC) | DER(%) | 6.37 | — | Unverified |