SOTAVerified

NOVIS: A Case for End-to-End Near-Online Video Instance Segmentation

2023-08-29Unverified0· sign in to hype

Tim Meinhardt, Matt Feiszli, Yuchen Fan, Laura Leal-Taixe, Rakesh Ranjan

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Until recently, the Video Instance Segmentation (VIS) community operated under the common belief that offline methods are generally superior to a frame by frame online processing. However, the recent success of online methods questions this belief, in particular, for challenging and long video sequences. We understand this work as a rebuttal of those recent observations and an appeal to the community to focus on dedicated near-online VIS approaches. To support our argument, we present a detailed analysis on different processing paradigms and the new end-to-end trainable NOVIS (Near-Online Video Instance Segmentation) method. Our transformer-based model directly predicts spatio-temporal mask volumes for clips of frames and performs instance tracking between clips via overlap embeddings. NOVIS represents the first near-online VIS approach which avoids any handcrafted tracking heuristics. We outperform all existing VIS methods by large margins and provide new state-of-the-art results on both YouTube-VIS (2019/2021) and the OVIS benchmarks.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
OVIS validationNOVIS (Swin-L)mask AP43.5Unverified
OVIS validationNOVIS (ResNet-50)mask AP32.7Unverified
YouTube-VIS 2021NOVIS (Swin-L)mask AP59.8Unverified
YouTube-VIS 2021NOVIS (ResNet-50)mask AP47.2Unverified
YouTube-VIS validationNOVIS (ResNet-50)mask AP52.8Unverified

Reproductions