MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

2021-12-02CVPR 2022Code Available1· sign in to hype

Yanghao Li, Chao-yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer

Code Available — Be the first to reproduce this paper.

Code

github.com/facebookresearch/mvit
Officialpytorch★ 453
github.com/facebookresearch/detectron2/tree/main/projects/MViTv2
Officialpytorch★ 0
github.com/JunweiLiang/aicity_action
pytorch★ 28
github.com/rajatmodi62/occludedactionbenchmark
pytorch★ 9

Abstract

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1% on Kinetics-400 video classification. Code and models are available at https://github.com/facebookresearch/mvit.

Tasks

Action Classification Action Recognition Image Classification Instance Segmentation Object Object Detection Video Classification Video Recognition

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
AVA v2.2	MViTv2-L (IN21k, K700)	mAP	34.4	—	Unverified
Something-Something V2	MViTv2-L (IN-21K + Kinetics400 pretrain)	Top-1 Accuracy	73.3	—	Unverified
Something-Something V2	MViTv2-B (IN-21K + Kinetics400 pretrain)	Top-5 Accuracy	93.4	—	Unverified
Something-Something V2	MViT-B (IN-21K + Kinetics400 pretrain)	Top-1 Accuracy	72.1	—	Unverified

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Code

Abstract

Tasks

Benchmark Results

Reproductions