gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window

2022-08-24Unverified0· sign in to hype

Mocho Go, Hideyuki Tachibana

Unverified — Be the first to reproduce this paper.

Abstract

Following the success in language domain, the self-attention mechanism (transformer) is adopted in the vision domain and achieving great success recently. Additionally, as another stream, multi-layer perceptron (MLP) is also explored in the vision domain. These architectures, other than traditional CNNs, have been attracting attention recently, and many methods have been proposed. As one that combines parameter efficiency and performance with locality and hierarchy in image recognition, we propose gSwin, which merges the two streams; Swin Transformer and (multi-head) gMLP. We showed that our gSwin can achieve better accuracy on three vision tasks, image classification, object detection and semantic segmentation, than Swin Transformer, with smaller model size.

Tasks

image-classification Image Classification Instance Segmentation object-detection Object Detection Semantic Segmentation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ImageNet	gSwin-S	Top 1 Accuracy	83.01	—	Unverified
ImageNet	gSwin-VT	Top 1 Accuracy	80.32	—	Unverified
ImageNet	gSwin-T	Top 1 Accuracy	81.71	—	Unverified

gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window

Abstract

Tasks

Benchmark Results

Reproductions