Rethinking Spatial Dimensions of Vision Transformers

2021-03-30ICCV 2021Code Available1· sign in to hype

Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, Seong Joon Oh

Code Available — Be the first to reproduce this paper.

Code

github.com/naver-ai/pit
OfficialIn paperpytorch★ 246
github.com/naver-ai/pflayer
pytorch★ 84
github.com/hatimwen/paddle_pit
paddle★ 5
github.com/conceptofmind/PiT-flax
jax★ 0
github.com/mindspore-courses/External-Attention-MindSpore/blob/main/model/backbone/PIT.py
mindspore★ 0
github.com/BR-IDL/PaddleViT/tree/develop/image_classification/PiT
paddle★ 0

Abstract

Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate the role of spatial dimension conversion and its effectiveness on transformer-based architecture. We particularly attend to the dimension reduction principle of CNNs; as the depth increases, a conventional CNN increases channel dimension and decreases spatial dimensions. We empirically show that such a spatial dimension reduction is beneficial to a transformer architecture as well, and propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model. We show that PiT achieves the improved model capability and generalization performance against ViT. Throughout the extensive experiments, we further show PiT outperforms the baseline on several tasks such as image classification, object detection, and robustness evaluation. Source codes and ImageNet models are available at https://github.com/naver-ai/pit

Tasks

Dimensionality Reduction image-classification Image Classification object-detection Object Detection

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ImageNet	PiT-Ti	Top 1 Accuracy	74.6	—	Unverified
ImageNet	PiT-B	Top 1 Accuracy	84	—	Unverified
ImageNet	PiT-S	Top 1 Accuracy	81.9	—	Unverified
ImageNet	PiT-XS	Top 1 Accuracy	79.1	—	Unverified

Rethinking Spatial Dimensions of Vision Transformers

Code

Abstract

Tasks

Benchmark Results

Reproductions