DVIS++: Improved Decoupled Framework for Universal Video Segmentation

2023-12-20Code Available1· sign in to hype

Tao Zhang, Xingye Tian, Yikang Zhou, Shunping Ji, Xuebo Wang, Xin Tao, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Yu Wu

Code Available — Be the first to reproduce this paper.

Code

github.com/zhang-tao-whu/DVIS_Plus
OfficialIn paperpytorch★ 137

Abstract

We present the Decoupled VIdeo Segmentation (DVIS) framework, a novel approach for the challenging task of universal video segmentation, including video instance segmentation (VIS), video semantic segmentation (VSS), and video panoptic segmentation (VPS). Unlike previous methods that model video segmentation in an end-to-end manner, our approach decouples video segmentation into three cascaded sub-tasks: segmentation, tracking, and refinement. This decoupling design allows for simpler and more effective modeling of the spatio-temporal representations of objects, especially in complex scenes and long videos. Accordingly, we introduce two novel components: the referring tracker and the temporal refiner. These components track objects frame by frame and model spatio-temporal representations based on pre-aligned features. To improve the tracking capability of DVIS, we propose a denoising training strategy and introduce contrastive learning, resulting in a more robust framework named DVIS++. Furthermore, we evaluate DVIS++ in various settings, including open vocabulary and using a frozen pre-trained backbone. By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework. We conduct extensive experiments on six mainstream benchmarks, including the VIS, VSS, and VPS datasets. Using a unified architecture, DVIS++ significantly outperforms state-of-the-art specialized methods on these benchmarks in both close- and open-vocabulary settings. Code:~https://github.com/zhang-tao-whu/DVIS_Plus.

Tasks

Contrastive Learning Denoising Instance Segmentation Panoptic Segmentation Segmentation Semantic Segmentation Video Instance Segmentation Video Panoptic Segmentation Video Segmentation Video Semantic Segmentation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
OVIS validation	DVIS++(R50, Offline)	mask AP	41.2	—	Unverified
OVIS validation	DVIS++(R50, Online)	mask AP	37.2	—	Unverified
OVIS validation	DVIS++(VIT-L,Offline)	mask AP	53.4	—	Unverified
OVIS validation	DVIS++(VIT-L, Online)	mask AP	49.6	—	Unverified
YouTube-VIS 2021	DVIS++(VIT-L, Offline)	mask AP	63.9	—	Unverified
YouTube-VIS 2021	DVIS++(VIT-L, Online)	mask AP	62.3	—	Unverified
Youtube-VIS 2022 Validation	DVIS++(VIT-L)	mAP_L	50.9	—	Unverified
YouTube-VIS validation	DVIS++(ViT-L, Online)	mask AP	67.7	—	Unverified

DVIS++: Improved Decoupled Framework for Universal Video Segmentation

Code

Abstract

Tasks

Benchmark Results

Reproductions