Temporally Consistent Referring Video Object Segmentation with Hybrid Memory

2024-03-28Code Available1· sign in to hype

Bo Miao, Mohammed Bennamoun, Yongsheng Gao, Mubarak Shah, Ajmal Mian

Code Available — Be the first to reproduce this paper.

Code

github.com/bo-miao/HTR
OfficialIn paperpytorch★ 19

Abstract

Referring Video Object Segmentation (R-VOS) methods face challenges in maintaining consistent object segmentation due to temporal context variability and the presence of other visually similar objects. We propose an end-to-end R-VOS paradigm that explicitly models temporal instance consistency alongside the referring segmentation. Specifically, we introduce a novel hybrid memory that facilitates inter-frame collaboration for robust spatio-temporal matching and propagation. Features of frames with automatically generated high-quality reference masks are propagated to segment the remaining frames based on multi-granularity association to achieve temporally consistent R-VOS. Furthermore, we propose a new Mask Consistency Score (MCS) metric to evaluate the temporal consistency of video segmentation. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin, leading to top-ranked performance on popular R-VOS benchmarks, i.e., Ref-YouTube-VOS (67.1%) and Ref-DAVIS17 (65.6%). The code is available at https://github.com/bo-miao/HTR.

Tasks

HTR Object Referring Expression Segmentation Referring Video Object Segmentation Segmentation Semantic Segmentation Video Object Segmentation Video Segmentation Video Semantic Segmentation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
DAVIS 2017 (val)	HTR	J&F 1st frame	65.6	—	Unverified
Refer-YouTube-VOS (2021 public validation)	HTR (Pre-training)	J&F	67.1	—	Unverified

Temporally Consistent Referring Video Object Segmentation with Hybrid Memory

Code

Abstract

Tasks

Benchmark Results

Reproductions