Rethinking Local Perception in Lightweight Vision Transformer

2023-03-31Code Available1· sign in to hype

Qihang Fan, Huaibo Huang, Jiyang Guan, Ran He

Code Available — Be the first to reproduce this paper.

Code

github.com/qhfan/CloFormer
OfficialIn paperpytorch★ 91

Abstract

Vision Transformers (ViTs) have been shown to be effective in various vision tasks. However, resizing them to a mobile-friendly size leads to significant performance degradation. Therefore, developing lightweight vision transformers has become a crucial area of research. This paper introduces CloFormer, a lightweight vision transformer that leverages context-aware local enhancement. CloFormer explores the relationship between globally shared weights often used in vanilla convolutional operators and token-specific context-aware weights appearing in attention, then proposes an effective and straightforward module to capture high-frequency local information. In CloFormer, we introduce AttnConv, a convolution operator in attention's style. The proposed AttnConv uses shared weights to aggregate local information and deploys carefully designed context-aware weights to enhance local features. The combination of the AttnConv and vanilla attention which uses pooling to reduce FLOPs in CloFormer enables the model to perceive high-frequency and low-frequency information. Extensive experiments were conducted in image classification, object detection, and semantic segmentation, demonstrating the superiority of CloFormer. The code is available at https://github.com/qhfan/CloFormer.

Tasks

image-classification Image Classification object-detection Object Detection Semantic Segmentation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ImageNet	CloFormer-S	Top 1 Accuracy	81.6	—	Unverified
ImageNet	CloFormer-XS	Top 1 Accuracy	79.8	—	Unverified
ImageNet	CloFormer-XXS	Top 1 Accuracy	77	—	Unverified

Rethinking Local Perception in Lightweight Vision Transformer

Code

Abstract

Tasks

Benchmark Results

Reproductions