SOTAVerified

Incorporating Convolution Designs into Visual Transformers

2021-03-22ICCV 2021Code Available1· sign in to hype

Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, Wei Wu

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Motivated by the success of Transformers in natural language processing (NLP) tasks, there emerge some attempts (e.g., ViT and DeiT) to apply Transformers to the vision domain. However, pure Transformer architectures often require a large amount of training data or extra supervision to obtain comparable performance with convolutional neural networks (CNNs). To overcome these limitations, we analyze the potential drawbacks when directly borrowing Transformer architectures from NLP. Then we propose a new Convolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Three modifications are made to the original Transformer: 1) instead of the straightforward tokenization from raw input images, we design an Image-to-Tokens (I2T) module that extracts patches from generated low-level features; 2) the feed-froward network in each encoder block is replaced with a Locally-enhanced Feed-Forward (LeFF) layer that promotes the correlation among neighboring tokens in the spatial dimension; 3) a Layer-wise Class token Attention (LCA) is attached at the top of the Transformer that utilizes the multi-level representations. Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers. Besides, CeiT models also demonstrate better convergence with 3 fewer training iterations, which can reduce the training cost significantlyCode and models will be released upon acceptance..

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
CIFAR-10CeiT-TPercentage correct98.5Unverified
CIFAR-10CeiT-SPercentage correct99Unverified
CIFAR-10CeiT-S (384 finetune resolution)Percentage correct99.1Unverified
CIFAR-100CeiT-T (384 finetune resolution)Percentage correct88Unverified
CIFAR-100CeiT-S (384 finetune resolution)Percentage correct91.8Unverified
CIFAR-100CeiT-TPercentage correct89.4Unverified
CIFAR-100CeiT-SPercentage correct91.8Unverified
Flowers-102CeiT-S (384 finetune resolution)Accuracy98.6Unverified
Flowers-102CeiT-SAccuracy98.2Unverified
Flowers-102CeiT-T (384 finetune resolution)Accuracy97.8Unverified
Flowers-102CeiT-TAccuracy96.9Unverified
ImageNetCeiT-TTop 1 Accuracy76.4Unverified
ImageNetCeiT-STop 1 Accuracy82Unverified
ImageNetCeiT-S (384 finetune res)Top 1 Accuracy83.3Unverified
ImageNetCeiT-T (384 finetune res)Top 1 Accuracy78.8Unverified
ImageNet ReaLCeiT-TAccuracy83.6Unverified
ImageNet ReaLCeiT-S (384 finetune res)Accuracy88.1Unverified
ImageNet ReaLCeiT-SAccuracy87.3Unverified
iNaturalist 2018CeiT-TTop-1 Accuracy64.3Unverified
iNaturalist 2018CeiT-T (384 finetune resolution)Top-1 Accuracy72.2Unverified
iNaturalist 2018CeiT-STop-1 Accuracy73.3Unverified
iNaturalist 2018CeiT-S (384 finetune resolution)Top-1 Accuracy79.4Unverified
iNaturalist 2019CeiT-S (384 finetune resolution)Top-1 Accuracy82.7Unverified
iNaturalist 2019CeiT-STop-1 Accuracy78.9Unverified
iNaturalist 2019CeiT-T (384 finetune resolution)Top-1 Accuracy77.9Unverified
iNaturalist 2019CeiT-TTop-1 Accuracy72.8Unverified
Oxford-IIIT PetsCeiT-TAccuracy93.8Unverified
Oxford-IIIT PetsCeiT-S (384 finetune resolution)Accuracy94.9Unverified
Oxford-IIIT PetsCeiT-SAccuracy94.6Unverified
Oxford-IIIT PetsCeiT-T (384 finetune resolution)Accuracy94.5Unverified
Stanford CarsCeiT-S (384 finetune resolution)Accuracy94.1Unverified
Stanford CarsCeiT-TAccuracy90.5Unverified
Stanford CarsCeiT-T (384 finetune resolution)Accuracy93Unverified
Stanford CarsCeiT-SAccuracy93.2Unverified

Reproductions