SOTAVerified

LocalViT: Bringing Locality to Vision Transformers

2021-04-12Code Available1· sign in to hype

Yawei Li, Kai Zhang, JieZhang Cao, Radu Timofte, Luc van Gool

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

We study how to introduce locality mechanisms into vision transformers. The transformer network originates from machine translation and is particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking a locality mechanism for information exchange within a local region. Yet, locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects. We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and all proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to 4 vision transformers, which shows the generalization of the locality concept. In particular, for ImageNet2012 classification, the locality-enhanced transformers outperform the baselines DeiT-T and PVT-T by 2.6\% and 3.1\% with a negligible increase in the number of parameters and computational effort. Code is available at https://github.com/ofsoundof/LocalViT.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
ImageNetLocalViT-STop 1 Accuracy80.8Unverified
ImageNetLocalViT-PVTTop 1 Accuracy78.2Unverified
ImageNetLocalViT-TNTTop 1 Accuracy75.9Unverified
ImageNetLocalViT-TTop 1 Accuracy74.8Unverified
ImageNetLocalViT-T2TTop 1 Accuracy72.5Unverified

Reproductions