UAVM: Towards Unifying Audio and Visual Models
2022-07-29Code Available1· sign in to hype
Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/YuanGongND/uavmOfficialIn paperpytorch★ 57
Abstract
Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| AudioSet | UAVM (Audio + Video) | Test mAP | 0.5 | — | Unverified |
| VGGSound | UAVM (Audio + Video) | Top 1 Accuracy | 65.8 | — | Unverified |
| VGGSound | UAVM (Audio Only) | Top 1 Accuracy | 56.5 | — | Unverified |
| VGGSound | UAVM (Video Only) | Top 1 Accuracy | 49.9 | — | Unverified |