MaxViT: Multi-Axis Vision Transformer

2022-04-04Code Available3· sign in to hype

Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, Yinxiao Li

Code Available — Be the first to reproduce this paper.

Code

github.com/google-research/maxvit
OfficialIn papertf★ 491
github.com/google-research/maxim
jax★ 1,083
github.com/ChristophReich1996/MaxViT
pytorch★ 164
github.com/qwopqwop200/MaxVIT-pytorch
pytorch★ 9
github.com/RooKichenn/pytorch-MaxViT
pytorch★ 8
github.com/hankyul2/maxvit-pytorch
pytorch★ 6
github.com/leondgarse/keras_cv_attention_models/tree/main/keras_cv_attention_models/maxvit
tf★ 0
github.com/Mind23-2/MindCode-3/tree/main/NFNet
mindspore★ 0
github.com/huggingface/pytorch-image-models/blob/main/timm/models/maxxvit.py
pytorch★ 0
github.com/2024-MindSpore-1/Code3/tree/main/MaxViT
mindspore★ 0

Abstract

Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to ''see'' globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. The source code and trained models will be available at https://github.com/google-research/maxvit.

Tasks

image-classification Image Classification object-detection Object Detection

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ImageNet	MaxViT-T (224res)	Top 1 Accuracy	83.62	—	Unverified
ImageNet	MaxViT-L (224res)	Top 1 Accuracy	85.17	—	Unverified
ImageNet	MaxViT-B (224res)	Top 1 Accuracy	84.94	—	Unverified
ImageNet	MaxViT-S (224res)	Top 1 Accuracy	84.45	—	Unverified
ImageNet	MaxViT-XL (512res, JFT)	Top 1 Accuracy	89.53	—	Unverified
ImageNet	MaxViT-L (512res, JFT)	Top 1 Accuracy	89.41	—	Unverified
ImageNet	MaxViT-XL (384res, JFT)	Top 1 Accuracy	89.36	—	Unverified
ImageNet	MaxViT-L (384res, JFT)	Top 1 Accuracy	89.12	—	Unverified
ImageNet	MaxViT-B (512res, JFT)	Top 1 Accuracy	88.82	—	Unverified
ImageNet	MaxViT-XL (512res, 21K)	Top 1 Accuracy	88.7	—	Unverified
ImageNet	MaxViT-B (384res, JFT)	Top 1 Accuracy	88.69	—	Unverified
ImageNet	MaxViT-XL (384res, 21K)	Top 1 Accuracy	88.51	—	Unverified
ImageNet	MaxViT-L (512res, 21K)	Top 1 Accuracy	88.46	—	Unverified
ImageNet	MaxViT-B (512res, 21K)	Top 1 Accuracy	88.38	—	Unverified
ImageNet	MaxViT-L (384res, 21K)	Top 1 Accuracy	88.32	—	Unverified
ImageNet	MaxViT-B (512res)	Top 1 Accuracy	86.7	—	Unverified
ImageNet	MaxViT-L (384res)	Top 1 Accuracy	86.4	—	Unverified
ImageNet	MaxViT-B (384res)	Top 1 Accuracy	86.34	—	Unverified
ImageNet	MaxViT-S (512res)	Top 1 Accuracy	86.19	—	Unverified
ImageNet	MaxViT-T (384res)	Top 1 Accuracy	85.72	—	Unverified

MaxViT: Multi-Axis Vision Transformer

Code

Abstract

Tasks

Benchmark Results

Reproductions