Training data-efficient image transformers & distillation through attention

2020-12-23Code Available1· sign in to hype

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou

Code Available — Be the first to reproduce this paper.

Code

github.com/hustvl/vim
pytorch★ 3,823
github.com/jacobgil/vit-explain
pytorch★ 1,074
github.com/TACJu/TransFG
pytorch★ 421
github.com/omihub777/vit-cifar
pytorch★ 206
github.com/moein-shariatnia/Pix2Seq
pytorch★ 130
github.com/gatech-eic/vitcod
pytorch★ 130
github.com/cogtoolslab/physics-benchmarking-neurips2021
none★ 87
github.com/UdbhavPrasad072300/Transformer-Implementations
pytorch★ 69
github.com/zhuhanqing/lightening-transformer
pytorch★ 41
github.com/aiot-mlsys-lab/famba-v
pytorch★ 34

Abstract

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

Tasks

Document Image Classification Document Layout Analysis Efficient ViTs Fine-Grained Image Classification image-classification Image Classification

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
RVL-CDIP	DeiT-B	Accuracy	90.32	—	Unverified

Training data-efficient image transformers & distillation through attention

Code

Abstract

Tasks

Benchmark Results

Reproductions