SOTAVerified

UAVM: Towards Unifying Audio and Visual Models

2022-07-29Code Available1· sign in to hype

Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
AudioSetUAVM (Audio + Video)Test mAP0.5Unverified
VGGSoundUAVM (Audio + Video)Top 1 Accuracy65.8Unverified
VGGSoundUAVM (Audio Only)Top 1 Accuracy56.5Unverified
VGGSoundUAVM (Video Only)Top 1 Accuracy49.9Unverified

Reproductions