Training data-efficient image transformers & distillation through attention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/facebookresearch/deitOfficialIn paperpytorch★ 4,327
- github.com/huggingface/transformerspytorch★ 158,292
- github.com/rwightman/pytorch-image-modelsIn paperpytorch★ 36,538
- github.com/PaddlePaddle/PaddleClaspaddle★ 5,788
- github.com/hustvl/vimpytorch★ 3,823
- github.com/alibaba/EasyCVpytorch★ 1,949
- github.com/open-edge-platform/training_extensionspytorch★ 1,220
- github.com/jacobgil/vit-explainpytorch★ 1,074
- github.com/open-edge-platform/getipytorch★ 467
- github.com/TACJu/TransFGpytorch★ 421
Abstract
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| RVL-CDIP | DeiT-B | Accuracy | 90.32 | — | Unverified |