Multi-Task Learning for Audio Visual Active Speaker Detection

2019-06-01The ActivityNet Large-Scale Activity Recognition Challenge Workshop, CVPR 2019Unverified0· sign in to hype

Yuanhang Zhang, Jingyun Xiao, Shuang Yang, Shiguang Shan

Unverified — Be the first to reproduce this paper.

Abstract

This report describes the approach underlying our submission to the active speaker detection task (task B-2) of ActivityNet Challenge 2019. We introduce a new audio-visual model which builds upon a 3D-ResNet18 visual model pretrained for lipreading and a VGG-M acoustic model pretrained for audio-to-video synchronization. The model is trained with two losses in a multi-task learning fashion: a contrastive loss to enforce matching between audio and video features for active speakers, and a regular crossentropy loss to obtain speaker / non-speaker labels. This model obtains 84.0% mAP on the validation set of AVAActiveSpeaker. Experimental results showcase the pretrained embeddings' abilities to transfer across tasks and data formats, as well as the advantage of the proposed multi-task learning strategy.

Tasks

Active Speaker Detection Audio-Visual Active Speaker Detection Lipreading Multi-Task Learning Video Synchronization

Multi-Task Learning for Audio Visual Active Speaker Detection

Abstract

Tasks

Reproductions