VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

2023-03-29CVPR 2023Code Available2· sign in to hype

LiMin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao

Code Available — Be the first to reproduce this paper.

Code

github.com/OpenGVLab/VideoMAEv2
OfficialIn paperpytorch★ 764

Abstract

Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder can still further reduce the overall computational cost. This enables the efficient pre-training of billion-level models in video. We also use a progressive training paradigm that involves an initial pre-training on a diverse multi-sourced unlabeled dataset, followed by a post-pre-training on a mixed labeled dataset. Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and 89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In addition, we extensively verify the pre-trained video ViT models on a variety of downstream tasks, demonstrating its effectiveness as a general video representation learner. The code and model is available at https://github.com/OpenGVLab/VideoMAEv2.

Tasks

Action Classification Action Recognition Action Recognition In Videos Decoder Self-Supervised Action Recognition Spatio-Temporal Action Localization Temporal Action Localization

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
AVA v2.2	VideoMAE V2-g	mAP	42.6	—	Unverified
HMDB-51	VideoMAE V2-g	Average accuracy of 3 splits	88.7	—	Unverified
Something-Something V1	VideoMAE V2-g	Top 1 Accuracy	68.7	—	Unverified
Something-Something V2	VideoMAE V2-g	Top-1 Accuracy	77	—	Unverified
UCF101	VideoMAE V2-g	3-fold Accuracy	99.6	—	Unverified

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Code

Abstract

Tasks

Benchmark Results

Reproductions