SOTAVerified

Associating Objects with Transformers for Video Object Segmentation

2021-06-04NeurIPS 2021Code Available1· sign in to hype

Zongxin Yang, Yunchao Wei, Yi Yang

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

This paper investigates how to realize better and more efficient embedding learning to tackle the semi-supervised video object segmentation under challenging multi-object scenarios. The state-of-the-art methods learn to decode features with a single positive object and thus have to match and segment each target separately under multi-object scenarios, consuming multiple times computing resources. To solve the problem, we propose an Associating Objects with Transformers (AOT) approach to match and decode multiple objects uniformly. In detail, AOT employs an identification mechanism to associate multiple targets into the same high-dimensional embedding space. Thus, we can simultaneously process multiple objects' matching and segmentation decoding as efficiently as processing a single object. For sufficiently modeling multi-object association, a Long Short-Term Transformer is designed for constructing hierarchical matching and propagation. We conduct extensive experiments on both multi-object and single-object benchmarks to examine AOT variant networks with different complexities. Particularly, our R50-AOT-L outperforms all the state-of-the-art competitors on three popular benchmarks, i.e., YouTube-VOS (84.1% J&F), DAVIS 2017 (84.9%), and DAVIS 2016 (91.1%), while keeping more than 3 faster multi-object run-time. Meanwhile, our AOT-T can maintain real-time multi-object speed on the above benchmarks. Based on AOT, we ranked 1st in the 3rd Large-scale VOS Challenge.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
DAVIS 2016AOT-TJ&F86.8Unverified
DAVIS 2016AOT-SJ&F89.4Unverified
DAVIS 2016AOT-LJ&F90.4Unverified
DAVIS 2016R50-AOT-LJ&F91.1Unverified
DAVIS 2016AOT-LJ&F89.9Unverified
DAVIS 2016SwinB-AOT-LJ&F92Unverified
DAVIS-2017 (test-dev)AOT-LJ&F78.3Unverified
DAVIS-2017 (test-dev)AOT-TJ&F72Unverified
DAVIS-2017 (test-dev)AOT-SJ&F73.9Unverified
DAVIS-2017 (test-dev)AOT-BJ&F75.5Unverified
DAVIS-2017 (test-dev)SwinB-AOT-LJ&F81.2Unverified
DAVIS-2017 (test-dev)R50-AOT-LJ&F79.6Unverified
DAVIS 2017 (val)AOT-TJ&F79.9Unverified
DAVIS 2017 (val)SwinB-AOT-LJ&F85.4Unverified
DAVIS 2017 (val)R50-AOT-LJ&F84.9Unverified
DAVIS 2017 (val)AOT-LJ&F83.8Unverified
DAVIS 2017 (val)AOT-BJ&F82.5Unverified
DAVIS 2017 (val)AOT-SJ&F81.3Unverified
DAVIS (no YouTube-VOS training)AOT-SD17 val (G)79.2Unverified
MOSEAOTJ&F57.2Unverified
VOT2020AOT-BEAO0.54Unverified
VOT2020R50-AOT-LEAO0.57Unverified
VOT2020AOT-LEAO0.57Unverified
VOT2020SwinB-AOT-LEAO0.59Unverified
VOT2020AOT-TEAO0.44Unverified
VOT2020AOT-SEAO0.51Unverified
YouTube-VOS 2018SwinB-AOT-L (all frames)Overall85.1Unverified
YouTube-VOS 2018AOT-SOverall82.6Unverified
YouTube-VOS 2018AOT-T (all frames)Overall80.9Unverified
YouTube-VOS 2018AOT-TOverall80.2Unverified
YouTube-VOS 2018R50-AOT-L (all frames)Overall85.5Unverified
YouTube-VOS 2018SwinB-AOT-LOverall84.5Unverified
YouTube-VOS 2018AOT-L (all frames)Overall84.5Unverified
YouTube-VOS 2018R50-AOT-LOverall84.1Unverified
YouTube-VOS 2018AOT-B (all frames)Overall84.1Unverified
YouTube-VOS 2018AOT-LOverall83.8Unverified
YouTube-VOS 2018AOT-BOverall83.5Unverified
YouTube-VOS 2018AOT-S (all frames)Overall83Unverified

Reproductions