Multiscale Vision Transformers

2021-04-22ICCV 2021Code Available1· sign in to hype

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer

Code Available — Be the first to reproduce this paper.

Code

github.com/facebookresearch/pytorchvideo
pytorch★ 3,551
github.com/facebookresearch/mvit
pytorch★ 453
github.com/wangjk666/stts
pytorch★ 52
github.com/junweiliang/multitrain
pytorch★ 20
github.com/rohanshad/cmr_transformer
pytorch★ 6

Abstract

We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. Code is available at: https://github.com/facebookresearch/SlowFast

Tasks

Action Classification Action Recognition image-classification Image Classification Video Recognition

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
AVA v2.2	MViT-B-24, 32x3 (Kinetics-600 pretraining)	mAP	28.7	—	Unverified
AVA v2.2	MViT-B, 32x3 (Kinetics-500 pretraining)	mAP	27.5	—	Unverified
AVA v2.2	MViT-B, 64x3 (Kinetics-400 pretraining)	mAP	27.3	—	Unverified
AVA v2.2	MViT-B, 32x3 (Kinetics-400 pretraining)	mAP	26.8	—	Unverified
AVA v2.2	MViT-B, 16x4 (Kinetics-600 pretraining)	mAP	26.1	—	Unverified
AVA v2.2	MViT-B, 16x4 (Kinetics-400 pretraining)	mAP	24.5	—	Unverified
Something-Something V2	MViT-B, 16x4	Top-1 Accuracy	66.2	—	Unverified
Something-Something V2	MViT-B-24, 32x3	Top-1 Accuracy	68.7	—	Unverified
Something-Something V2	MViT-B, 32x3(Kinetics600 pretrain)	Top-1 Accuracy	67.8	—	Unverified

Multiscale Vision Transformers

Code

Abstract

Tasks

Benchmark Results

Reproductions