VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong, Yibing Song, Jue Wang, LiMin Wang
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/MCG-NJU/VideoMAEOfficialIn paperpytorch★ 1,700
- github.com/MCG-NJU/VideoMAE-Action-DetectionOfficialpytorch★ 69
- github.com/huggingface/transformerspytorch★ 158,292
- github.com/innat/VideoMAEtf★ 22
- github.com/MindSpore-scientific/code-13/tree/main/token_learnermindspore★ 0
- github.com/MS-P3/code7/tree/main/videomaemindspore★ 0
- github.com/pwc-1/Paper-9/tree/main/5/videomaemindspore★ 0
- github.com/MindSpore-scientific/code-7/tree/main/VideoMAEnone★ 0
- github.com/MindCode-4/code-1/tree/main/videomaemindspore★ 0
Abstract
Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 87.4% on Kinetics-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data. Code is available at https://github.com/MCG-NJU/VideoMAE.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| AVA v2.2 | VideoMAE (K400 pretrain, ViT-B, 16x4) | mAP | 26.7 | — | Unverified |
| AVA v2.2 | VideoMAE (K700 pretrain, ViT-L, 16x4) | mAP | 36.1 | — | Unverified |
| AVA v2.2 | VideoMAE (K400 pretrain, ViT-L, 16x4) | mAP | 34.3 | — | Unverified |
| AVA v2.2 | VideoMAE (K400 pretrain+finetune, ViT-B, 16x4) | mAP | 31.8 | — | Unverified |
| AVA v2.2 | VideoMAE (K400 pretrain+finetune, ViT-H, 16x4) | mAP | 39.5 | — | Unverified |
| AVA v2.2 | VideoMAE (K700 pretrain+finetune, ViT-L, 16x4) | mAP | 39.3 | — | Unverified |
| AVA v2.2 | VideoMAE (K400 pretrain+finetune, ViT-L, 16x4) | mAP | 37.8 | — | Unverified |
| AVA v2.2 | VideoMAE (K400 pretrain, ViT-H, 16x4) | mAP | 36.5 | — | Unverified |
| Something-Something V2 | VideoMAE (no extra data, ViT-L, 16frame) | Top-1 Accuracy | 74.3 | — | Unverified |
| Something-Something V2 | VideoMAE (no extra data, ViT-B, 16frame) | Top-1 Accuracy | 70.8 | — | Unverified |
| Something-Something V2 | VideoMAE (no extra data, ViT-L, 32x2) | Top-1 Accuracy | 75.4 | — | Unverified |