SOTAVerified

Audiovisual Masked Autoencoders

2022-12-09ICCV 2023Code Available0· sign in to hype

Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
AudioSetAudiovisual Masked Autoencoder (Audiovisual, Single)Test mAP0.52Unverified
AudioSetAudiovisual Masked Autoencoder (Audio-only, Single)Test mAP0.47Unverified
EPIC-KITCHENS-100Audiovisual Masked Autoencoder (Audio-only, Single)Top-1 Action19.7Unverified
EPIC-KITCHENS-100Audiovisual Masked Autoencoder (Audiovisual, Single)Top-1 Action46Unverified
EPIC-KITCHENS-100Audiovisual Masked Autoencoder (Video-only, Single)Top-1 Action45.8Unverified
VGGSoundAudiovisual Masked Autoencoder (Audio-only, Single)Top 1 Accuracy57.2Unverified
VGGSoundAudiovisual Masked Autoencoder (Audiovisual, Single)Top 1 Accuracy65Unverified

Reproductions