Masked Feature Prediction for Self-Supervised Visual Pre-Training

2021-12-16CVPR 2022Code Available1· sign in to hype

Chen Wei, Haoqi Fan, Saining Xie, Chao-yuan Wu, Alan Yuille, Christoph Feichtenhofer

Code Available — Be the first to reproduce this paper.

Code

github.com/mx-mark/videotransformer-pytorch
pytorch★ 306
github.com/mx-mark/dmjd
pytorch★ 11
github.com/yyk-wew/semanticmim
pytorch★ 8

Abstract

We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 39.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.

Tasks

Action Classification Action Recognition Prediction Self-Supervised Image Classification

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
AVA v2.2	MaskFeat (Kinetics-600 pretrain, MViT-L)	mAP	39.8	—	Unverified
Something-Something V2	MaskFeat (Kinetics600 pretrain, MViT-L)	Top-1 Accuracy	75	—	Unverified

Masked Feature Prediction for Self-Supervised Visual Pre-Training

Code

Abstract

Tasks

Benchmark Results

Reproductions