Exploring Plain Vision Transformer Backbones for Object Detection

2022-03-30Code Available2· sign in to hype

Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He

Code Available — Be the first to reproduce this paper.

Code

github.com/facebookresearch/detectron2/tree/main/projects/ViTDet
Officialpytorch★ 0
github.com/ViTAE-Transformer/ViTDet
pytorch★ 579
github.com/vitae-transformer/qformer
pytorch★ 235
github.com/kdexd/coco-rem
pytorch★ 32
github.com/hula-ai/DAMA
pytorch★ 17
github.com/MindCode-4/code-1/tree/main/vit
mindspore★ 0
github.com/pwc-1/Paper-9/tree/main/1/vitdet
mindspore★ 0
github.com/MindCode-4/code-5/tree/main/vitdet
mindspore★ 0
github.com/PaddlePaddle/PaddleDetection/tree/develop/configs/vitdet
paddle★ 0

Abstract

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available in Detectron2.

Tasks

Cross-Domain Few-Shot Object Detection Instance Segmentation Object object-detection Object Detection

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
COCO minival	ViTDet, ViT-H Cascade (multiscale)	mask AP	53.1	—	Unverified
COCO minival	ViTDet, ViT-H Cascade	mask AP	52	—	Unverified
LVIS v1.0 val	ViTDet-L	mask AP	46	—	Unverified
LVIS v1.0 val	ViTDet-H	mask AP	48.1	—	Unverified

Exploring Plain Vision Transformer Backbones for Object Detection

Code

Abstract

Tasks

Benchmark Results

Reproductions