Associating Objects with Transformers for Video Object Segmentation

2021-06-04NeurIPS 2021Code Available1· sign in to hype

Zongxin Yang, Yunchao Wei, Yi Yang

Code Available — Be the first to reproduce this paper.

Code

github.com/z-x-yang/AOT
OfficialIn paperpaddle★ 146
github.com/yoxu515/aot-benchmark
pytorch★ 585

Abstract

This paper investigates how to realize better and more efficient embedding learning to tackle the semi-supervised video object segmentation under challenging multi-object scenarios. The state-of-the-art methods learn to decode features with a single positive object and thus have to match and segment each target separately under multi-object scenarios, consuming multiple times computing resources. To solve the problem, we propose an Associating Objects with Transformers (AOT) approach to match and decode multiple objects uniformly. In detail, AOT employs an identification mechanism to associate multiple targets into the same high-dimensional embedding space. Thus, we can simultaneously process multiple objects' matching and segmentation decoding as efficiently as processing a single object. For sufficiently modeling multi-object association, a Long Short-Term Transformer is designed for constructing hierarchical matching and propagation. We conduct extensive experiments on both multi-object and single-object benchmarks to examine AOT variant networks with different complexities. Particularly, our R50-AOT-L outperforms all the state-of-the-art competitors on three popular benchmarks, i.e., YouTube-VOS (84.1% J&F), DAVIS 2017 (84.9%), and DAVIS 2016 (91.1%), while keeping more than 3 faster multi-object run-time. Meanwhile, our AOT-T can maintain real-time multi-object speed on the above benchmarks. Based on AOT, we ranked 1st in the 3rd Large-scale VOS Challenge.

Tasks

Object One-shot visual object segmentation Semantic Segmentation Semi-Supervised Video Object Segmentation Video Object Segmentation Video Semantic Segmentation Visual Object Tracking

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
DAVIS 2016	AOT-T	J&F	86.8	—	Unverified
DAVIS 2016	AOT-S	J&F	89.4	—	Unverified
DAVIS 2016	AOT-L	J&F	90.4	—	Unverified
DAVIS 2016	R50-AOT-L	J&F	91.1	—	Unverified
DAVIS 2016	AOT-L	J&F	89.9	—	Unverified
DAVIS 2016	SwinB-AOT-L	J&F	92	—	Unverified
DAVIS-2017 (test-dev)	AOT-L	J&F	78.3	—	Unverified
DAVIS-2017 (test-dev)	AOT-T	J&F	72	—	Unverified
DAVIS-2017 (test-dev)	AOT-S	J&F	73.9	—	Unverified
DAVIS-2017 (test-dev)	AOT-B	J&F	75.5	—	Unverified
DAVIS-2017 (test-dev)	SwinB-AOT-L	J&F	81.2	—	Unverified
DAVIS-2017 (test-dev)	R50-AOT-L	J&F	79.6	—	Unverified
DAVIS 2017 (val)	AOT-T	J&F	79.9	—	Unverified
DAVIS 2017 (val)	SwinB-AOT-L	J&F	85.4	—	Unverified
DAVIS 2017 (val)	R50-AOT-L	J&F	84.9	—	Unverified
DAVIS 2017 (val)	AOT-L	J&F	83.8	—	Unverified
DAVIS 2017 (val)	AOT-B	J&F	82.5	—	Unverified
DAVIS 2017 (val)	AOT-S	J&F	81.3	—	Unverified
DAVIS (no YouTube-VOS training)	AOT-S	D17 val (G)	79.2	—	Unverified
MOSE	AOT	J&F	57.2	—	Unverified
VOT2020	AOT-B	EAO	0.54	—	Unverified
VOT2020	R50-AOT-L	EAO	0.57	—	Unverified
VOT2020	AOT-L	EAO	0.57	—	Unverified
VOT2020	SwinB-AOT-L	EAO	0.59	—	Unverified
VOT2020	AOT-T	EAO	0.44	—	Unverified
VOT2020	AOT-S	EAO	0.51	—	Unverified
YouTube-VOS 2018	SwinB-AOT-L (all frames)	Overall	85.1	—	Unverified
YouTube-VOS 2018	AOT-S	Overall	82.6	—	Unverified
YouTube-VOS 2018	AOT-T (all frames)	Overall	80.9	—	Unverified
YouTube-VOS 2018	AOT-T	Overall	80.2	—	Unverified
YouTube-VOS 2018	R50-AOT-L (all frames)	Overall	85.5	—	Unverified
YouTube-VOS 2018	SwinB-AOT-L	Overall	84.5	—	Unverified
YouTube-VOS 2018	AOT-L (all frames)	Overall	84.5	—	Unverified
YouTube-VOS 2018	R50-AOT-L	Overall	84.1	—	Unverified
YouTube-VOS 2018	AOT-B (all frames)	Overall	84.1	—	Unverified
YouTube-VOS 2018	AOT-L	Overall	83.8	—	Unverified
YouTube-VOS 2018	AOT-B	Overall	83.5	—	Unverified
YouTube-VOS 2018	AOT-S (all frames)	Overall	83	—	Unverified

Associating Objects with Transformers for Video Object Segmentation

Code

Abstract

Tasks

Benchmark Results

Reproductions