UAVM: Towards Unifying Audio and Visual Models

2022-07-29Code Available1· sign in to hype

Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass

Code Available — Be the first to reproduce this paper.

Code

github.com/YuanGongND/uavm
OfficialIn paperpytorch★ 57

Abstract

Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have.

Tasks

Audio Classification audio-visual learning Multi-modal Classification

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
AudioSet	UAVM (Audio + Video)	Test mAP	0.5	—	Unverified
VGGSound	UAVM (Audio + Video)	Top 1 Accuracy	65.8	—	Unverified
VGGSound	UAVM (Audio Only)	Top 1 Accuracy	56.5	—	Unverified
VGGSound	UAVM (Video Only)	Top 1 Accuracy	49.9	—	Unverified

UAVM: Towards Unifying Audio and Visual Models

Code

Abstract

Tasks

Benchmark Results

Reproductions