AVA-AVD: Audio-Visual Speaker Diarization in the Wild
Eric Zhongcong Xu, Zeyang Song, Satoshi Tsutsui, Chao Feng, Mang Ye, Mike Zheng Shou
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/zcxu-eric/ava-avdOfficialIn paperpytorch★ 50
- github.com/showlab/ava-avdOfficialIn paperpytorch★ 21
- github.com/pyannote/pyannote-audiopytorch★ 9,388
- github.com/frenchkrab/is2023-powerset-diarizationnone★ 93
- github.com/MindSpore-paper-code-3/code1/tree/main/AVA_hpamindspore★ 0
- github.com/pwc-1/Paper-9/tree/main/6/AVA_cifar/src/RandAugmentmindspore★ 0
- github.com/MindSpore-paper-code-3/code6/tree/main/AVA_hpamindspore★ 0
- github.com/MindSpore-scientific/code-12/tree/main/AVA_cifarmindspore★ 0
- github.com/2023-MindSpore-1/ms-code-17/tree/main/AVA_hpamindspore★ 0
Abstract
Audio-visual speaker diarization aims at detecting "who spoke when" using both auditory and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios, which are quite different from in-the-wild videos in many scenarios such as movies, documentaries, and audience sitcoms. To develop diarization methods for these challenging videos, we create the AVA Audio-Visual Diarization (AVA-AVD) dataset. Our experiments demonstrate that adding AVA-AVD into training set can produce significantly better diarization models for in-the-wild videos despite that the data is relatively small. Moreover, this benchmark is challenging due to the diverse scenes, complicated acoustic conditions, and completely off-screen speakers. As a first step towards addressing the challenges, we design the Audio-Visual Relation Network (AVR-Net) which introduces a simple yet effective modality mask to capture discriminative information based on face visibility. Experiments show that our method not only can outperform state-of-the-art methods but is more robust as varying the ratio of off-screen speakers. Our data and code has been made publicly available at https://github.com/showlab/AVA-AVD.