MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation

2024-12-16Code Available1· sign in to hype

Quan-Sheng Zeng, Yunheng Li, Daquan Zhou, Guanbin Li, Qibin Hou, Ming-Ming Cheng

Code Available — Be the first to reproduce this paper.

Code

github.com/hvision-nku/maskclippp
OfficialIn paperpytorch★ 45

Abstract

Open-vocabulary image segmentation has been advanced through the synergy between mask generators and vision-language models like Contrastive Language-Image Pre-training (CLIP). Previous approaches focus on generating masks while aligning mask features with text embeddings during training. In this paper, we observe that relying on generated low-quality masks can weaken the alignment of vision and language in regional representations. This motivates us to present a new fine-tuning framework, named MaskCLIP++, which uses ground-truth masks instead of generated masks to enhance the mask classification capability of CLIP. Due to the limited diversity of image segmentation datasets with mask annotations, we propose incorporating a consistency alignment constraint during fine-tuning, which alleviates categorical bias toward the fine-tuning dataset. After low-cost fine-tuning, combining with the mask generator in previous state-of-the-art mask-based open vocabulary segmentation methods, we achieve performance improvements of +1.7, +2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20 datasets, respectively. Code is released at https://github.com/HVision-NKU/MaskCLIPpp .

Tasks

Image Segmentation Open Vocabulary Semantic Segmentation Open-Vocabulary Semantic Segmentation Segmentation Semantic Segmentation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ADE20K-150	MaskCLIP++	mIoU	38.2	—	Unverified
ADE20K-847	MaskCLIP++	mIoU	16.8	—	Unverified
PASCAL Context-459	MaskCLIP++	mIoU	23.9	—	Unverified
PASCAL Context-59	MaskCLIP++	mIoU	62.5	—	Unverified
PascalVOC-20	MaskCLIP++	mIoU	96.8	—	Unverified

MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation

Code

Abstract

Tasks

Benchmark Results

Reproductions