SegViT: Semantic Segmentation with Plain Vision Transformers

2022-10-12Code Available2· sign in to hype

BoWen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xiaolin Wei, Chunhua Shen, Yifan Liu

Code Available — Be the first to reproduce this paper.

Code

github.com/zbwxp/SegVit
OfficialIn paperpytorch★ 261

Abstract

We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose the SegVit. Previous ViT-based segmentation networks usually learn a pixel-level representation from the output of the ViT. Differently, we make use of the fundamental component -- attention mechanism, to generate masks for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM) module, in which the similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the ViT backbone, we propose query-based down-sampling (QD) and query-based up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk structure, the model can save up to 40\% computations while maintaining competitive performance.

Tasks

Segmentation Semantic Segmentation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ADE20K val	SegViT ViT-Large	mIoU	55.2	—	Unverified
COCO-Stuff test	SegViT (ours)	mIoU	50.3	—	Unverified
PASCAL Context	SegViT (ours)	mIoU	65.3	—	Unverified

SegViT: Semantic Segmentation with Plain Vision Transformers

Code

Abstract

Tasks

Benchmark Results

Reproductions