Semi-supervised Vision Transformers at Scale

2022-08-11Code Available1· sign in to hype

Zhaowei Cai, Avinash Ravichandran, Paolo Favaro, Manchen Wang, Davide Modolo, Rahul Bhotika, Zhuowen Tu, Stefano Soatto

Code Available — Be the first to reproduce this paper.

Code

github.com/amazon-science/semi-vit
pytorch★ 61

Abstract

We study semi-supervised learning (SSL) for vision transformers (ViT), an under-explored topic despite the wide adoption of the ViT architectures to different tasks. To tackle this problem, we propose a new SSL pipeline, consisting of first un/self-supervised pre-training, followed by supervised fine-tuning, and finally semi-supervised fine-tuning. At the semi-supervised fine-tuning stage, we adopt an exponential moving average (EMA)-Teacher framework instead of the popular FixMatch, since the former is more stable and delivers higher accuracy for semi-supervised vision transformers. In addition, we propose a probabilistic pseudo mixup mechanism to interpolate unlabeled samples and their pseudo labels for improved regularization, which is important for training ViTs with weak inductive bias. Our proposed method, dubbed Semi-ViT, achieves comparable or better performance than the CNN counterparts in the semi-supervised classification setting. Semi-ViT also enjoys the scalability benefits of ViTs that can be readily scaled up to large-size models with increasing accuracies. For example, Semi-ViT-Huge achieves an impressive 80% top-1 accuracy on ImageNet using only 1% labels, which is comparable with Inception-v4 using 100% ImageNet labels.

Tasks

Inductive Bias Semi-Supervised Image Classification

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ImageNet - 10% labeled data	Semi-ViT (ViT-Huge)	Top 1 Accuracy	84.3	—	Unverified
ImageNet - 10% labeled data	Semi-ViT (ViT-Large)	Top 1 Accuracy	83.3	—	Unverified
ImageNet - 10% labeled data	Semi-ViT (ViT-Base)	Top 1 Accuracy	79.7	—	Unverified
ImageNet - 10% labeled data	Semi-ViT (ViT-Small)	Top 1 Accuracy	77.1	—	Unverified
ImageNet - 1% labeled data	Semi-ViT (ViT-Huge)	Top 1 Accuracy	80	—	Unverified
ImageNet - 1% labeled data	Semi-ViT (ViT-Large)	Top 1 Accuracy	77.3	—	Unverified
ImageNet - 1% labeled data	Semi-ViT (ViT-Base)	Top 1 Accuracy	71	—	Unverified

Semi-supervised Vision Transformers at Scale

Code

Abstract

Tasks

Benchmark Results

Reproductions