Training data-efficient image transformers & distillation through attention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/hustvl/vimpytorch★ 3,823
- github.com/jacobgil/vit-explainpytorch★ 1,074
- github.com/TACJu/TransFGpytorch★ 421
- github.com/omihub777/vit-cifarpytorch★ 206
- github.com/moein-shariatnia/Pix2Seqpytorch★ 130
- github.com/gatech-eic/vitcodpytorch★ 130
- github.com/cogtoolslab/physics-benchmarking-neurips2021none★ 87
- github.com/UdbhavPrasad072300/Transformer-Implementationspytorch★ 69
- github.com/zhuhanqing/lightening-transformerpytorch★ 41
- github.com/aiot-mlsys-lab/famba-vpytorch★ 34
Abstract
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| RVL-CDIP | DeiT-B | Accuracy | 90.32 | — | Unverified |