Exploring Plain Vision Transformer Backbones for Object Detection
Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/facebookresearch/detectron2/tree/main/projects/ViTDetOfficialpytorch★ 0
- github.com/alibaba/EasyCVpytorch★ 1,949
- github.com/ViTAE-Transformer/ViTDetpytorch★ 579
- github.com/vitae-transformer/qformerpytorch★ 235
- github.com/kdexd/coco-rempytorch★ 32
- github.com/hula-ai/DAMApytorch★ 17
- gitlab.com/birder/birderpytorch★ 0
- github.com/pwc-1/Paper-9/tree/main/1/vitdetmindspore★ 0
- github.com/MindCode-4/code-1/tree/main/vitmindspore★ 0
- github.com/PaddlePaddle/PaddleDetection/tree/develop/configs/vitdetpaddle★ 0
Abstract
We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available in Detectron2.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| COCO minival | ViTDet, ViT-H Cascade (multiscale) | mask AP | 53.1 | — | Unverified |
| COCO minival | ViTDet, ViT-H Cascade | mask AP | 52 | — | Unverified |
| LVIS v1.0 val | ViTDet-L | mask AP | 46 | — | Unverified |
| LVIS v1.0 val | ViTDet-H | mask AP | 48.1 | — | Unverified |