SOTAVerified

TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context

2021-10-08Code Available1· sign in to hype

Nithin Rao Koluguri, Taejin Park, Boris Ginsburg

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

In this paper, we propose TitaNet, a novel neural network architecture for extracting speaker representations. We employ 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map variable-length utterances to a fixed-length embedding (t-vector). TitaNet is a scalable architecture and achieves state-of-the-art performance on speaker verification task with an equal error rate (EER) of 0.68% on the VoxCeleb1 trial file and also on speaker diarization tasks with diarization error rate (DER) of 1.73% on AMI-MixHeadset, 1.99% on AMI-Lapel and 1.11% on CH109. Furthermore, we investigate various sizes of TitaNet and present a light TitaNet-S model with only 6M parameters that achieve near state-of-the-art results in diarization tasks.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
AMI LapelTitaNet-M (NME-SC)DER(%)1.99Unverified
AMI LapelTitaNet-S (NME-SC)DER(%)2Unverified
AMI LapelTitaNet-L (NME-SC)DER(%)2.03Unverified
AMI LapelECAPA (SC)DER(%)2.36Unverified
AMI MixHeadsetTitaNet-S (NME-SC)DER(%)2.22Unverified
AMI MixHeadsetTitaNet-M (NME-SC)DER(%)1.79Unverified
AMI MixHeadsetECAPA (SC)DER(%)1.78Unverified
AMI MixHeadsetTitaNet-L (NME-SC)DER(%)1.73Unverified
CALLHOME-109titanet-sDER(%)1.11Unverified
CH109TitaNet-M (NME-SC)DER(%)1.13Unverified
CH109TitaNet-L (NME-SC)DER(%)1.19Unverified
CH109x-vector (PLDA + AHC)DER(%)9.72Unverified
CH109TitaNet-S (NME-SC)DER(%)1.11Unverified
NIST-SRE 2000x-vector (MCGAN)DER(%)5.73Unverified
NIST-SRE 2000x-vector (PLDA + AHC)DER(%)8.39Unverified
NIST-SRE 2000TitaNet-L (NME-SC)DER(%)6.73Unverified
NIST-SRE 2000TitaNet-M (NME-SC)DER(%)6.47Unverified
NIST-SRE 2000TitaNet-S (NME-SC)DER(%)6.37Unverified

Reproductions