VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

2022-03-23Code Available3· sign in to hype

Zhan Tong, Yibing Song, Jue Wang, LiMin Wang

Code Available — Be the first to reproduce this paper.

Code

github.com/MCG-NJU/VideoMAE
OfficialIn paperpytorch★ 1,700
github.com/MCG-NJU/VideoMAE-Action-Detection
Officialpytorch★ 69
github.com/innat/VideoMAE
tf★ 22
github.com/MindCode-4/code-1/tree/main/videomae
mindspore★ 0
github.com/MindSpore-scientific/code-13/tree/main/token_learner
mindspore★ 0
github.com/MS-P3/code7/tree/main/videomae
mindspore★ 0
github.com/pwc-1/Paper-9/tree/main/5/videomae
mindspore★ 0
github.com/MindSpore-scientific/code-7/tree/main/VideoMAE
none★ 0

Abstract

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 87.4% on Kinetics-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data. Code is available at https://github.com/MCG-NJU/VideoMAE.

Tasks

4k Action Classification Action Recognition Self-Supervised Action Recognition Self-Supervised Action Recognition Linear Video Reconstruction Video Understanding

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
AVA v2.2	VideoMAE (K400 pretrain, ViT-B, 16x4)	mAP	26.7	—	Unverified
AVA v2.2	VideoMAE (K700 pretrain, ViT-L, 16x4)	mAP	36.1	—	Unverified
AVA v2.2	VideoMAE (K400 pretrain, ViT-L, 16x4)	mAP	34.3	—	Unverified
AVA v2.2	VideoMAE (K400 pretrain+finetune, ViT-B, 16x4)	mAP	31.8	—	Unverified
AVA v2.2	VideoMAE (K400 pretrain+finetune, ViT-H, 16x4)	mAP	39.5	—	Unverified
AVA v2.2	VideoMAE (K700 pretrain+finetune, ViT-L, 16x4)	mAP	39.3	—	Unverified
AVA v2.2	VideoMAE (K400 pretrain+finetune, ViT-L, 16x4)	mAP	37.8	—	Unverified
AVA v2.2	VideoMAE (K400 pretrain, ViT-H, 16x4)	mAP	36.5	—	Unverified
Something-Something V2	VideoMAE (no extra data, ViT-L, 16frame)	Top-1 Accuracy	74.3	—	Unverified
Something-Something V2	VideoMAE (no extra data, ViT-B, 16frame)	Top-1 Accuracy	70.8	—	Unverified
Something-Something V2	VideoMAE (no extra data, ViT-L, 32x2)	Top-1 Accuracy	75.4	—	Unverified

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Code

Abstract

Tasks

Benchmark Results

Reproductions