Per-Pixel Classification is Not All You Need for Semantic Segmentation
Bowen Cheng, Alexander G. Schwing, Alexander Kirillov
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/facebookresearch/MaskFormerOfficialpytorch★ 1,453
- github.com/huggingface/transformerspytorch★ 158,292
- github.com/open-mmlab/mmdetectionpytorch★ 32,525
Abstract
Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| ADE20K val | MaskFormer (R101 + 6 Enc) | PQ | 35.7 | — | Unverified |
| COCO minival | MaskFormer (single-scale) | PQ | 52.7 | — | Unverified |
| COCO test-dev | MaskFormer (Swin-L) | PQ | 53.3 | — | Unverified |