SOTAVerified

Training Vision Transformers for Semi-Supervised Semantic Segmentation

2024-01-01CVPR 2024Code Available1· sign in to hype

Xinting Hu, Li Jiang, Bernt Schiele

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

We present S4Former a novel approach to training Vision Transformers for Semi-Supervised Semantic Segmentation (S4). At its core S4Former employs a Vision Transformer within a classic teacher-student framework and then leverages three novel technical ingredients: PatchShuffle as a parameter-free perturbation technique Patch-Adaptive Self-Attention (PASA) as a fine-grained feature modulation method and the innovative Negative Class Ranking (NCR) regularization loss. Based on these regularization modules aligned with Transformer-specific characteristics across the image input feature and output dimensions S4Former exploits the Transformer's ability to capture and differentiate consistent global contextual information in unlabeled images. Overall S4Former not only defines a new state of the art in S4 but also maintains a streamlined and scalable architecture. Being readily compatible with existing frameworks S4Former achieves strong improvements (up to 4.9%) on benchmarks like Pascal VOC 2012 COCO and Cityscapes with varying numbers of labeled data. The code is at https://github.com/JoyHuYY1412/S4Former.

Tasks

Reproductions