XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model
Ho Kei Cheng, Alexander G. Schwing
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/hkchengrex/XMemOfficialpytorch★ 1,962
- github.com/tianyuan168326/videosemanticcompression-pytorchpytorch★ 37
Abstract
We present XMem, a video object segmentation architecture for long videos with unified feature memory stores inspired by the Atkinson-Shiffrin memory model. Prior work on video object segmentation typically only uses one type of feature memory. For videos longer than a minute, a single feature memory model tightly links memory consumption and accuracy. In contrast, following the Atkinson-Shiffrin model, we develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores: a rapidly updated sensory memory, a high-resolution working memory, and a compact thus sustained long-term memory. Crucially, we develop a memory potentiation algorithm that routinely consolidates actively used working memory elements into the long-term memory, which avoids memory explosion and minimizes performance decay for long-term prediction. Combined with a new memory reading mechanism, XMem greatly exceeds state-of-the-art performance on long-video datasets while being on par with state-of-the-art methods (that do not work on long videos) on short-video datasets. Code is available at https://hkchengrex.github.io/XMem
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| DAVIS 2016 | XMem (DAVIS only) | J&F | 87.8 | — | Unverified |
| DAVIS 2016 | XMem (DAVIS+YouTubeVOS only) | J&F | 90.8 | — | Unverified |
| DAVIS 2016 | XMem (BL30K) | J&F | 92 | — | Unverified |
| DAVIS 2016 | XMem (MS) | J&F | 92.7 | — | Unverified |
| DAVIS-2017 (test-dev) | XMem (DAVIS and YouTubeVOS only) | J&F | 79.8 | — | Unverified |
| DAVIS-2017 (test-dev) | XMem (MS) | J&F | 83.1 | — | Unverified |
| DAVIS-2017 (test-dev) | XMem (BL30K, MS) | J&F | 83.7 | — | Unverified |
| DAVIS-2017 (test-dev) | XMem (BL30K, 600p) | J&F | 82.5 | — | Unverified |
| DAVIS-2017 (test-dev) | XMem (BL30K) | J&F | 81.2 | — | Unverified |
| DAVIS-2017 (test-dev) | XMem | J&F | 81 | — | Unverified |
| DAVIS 2017 (val) | XMem | J&F | 86.2 | — | Unverified |
| DAVIS 2017 (val) | XMem (BL30K, MS) | J&F | 89.5 | — | Unverified |
| DAVIS 2017 (val) | XMem (MS) | J&F | 88.2 | — | Unverified |
| DAVIS 2017 (val) | XMem (BL30K) | J&F | 87.7 | — | Unverified |
| DAVIS 2017 (val) | XMem (DAVIS and YouTubeVOS only) | J&F | 84.5 | — | Unverified |
| DAVIS 2017 (val) | XMem (DAVIS only) | J&F | 76.7 | — | Unverified |
| DAVIS (no YouTube-VOS training) | XMem | FPS | 29.6 | — | Unverified |
| MOSE | XMem | J&F | 57.6 | — | Unverified |
| YouTube-VOS 2018 | XMem | Overall | 85.7 | — | Unverified |
| YouTube-VOS 2018 | XMem (YouTubeVOS only) | Overall | 84.4 | — | Unverified |
| YouTube-VOS 2018 | XMem (MS) | Overall | 86.7 | — | Unverified |
| YouTube-VOS 2018 | XMem (BL30K) | Overall | 86.1 | — | Unverified |
| YouTube-VOS 2018 | XMem (BL30K, MS) | Overall | 86.9 | — | Unverified |
| YouTube-VOS 2019 | XMem (BL30K) | Overall | 85.8 | — | Unverified |
| YouTube-VOS 2019 | XMem | Overall | 84.3 | — | Unverified |
| YouTube-VOS 2019 | XMem (BL30K, MS) | Overall | 86.8 | — | Unverified |
| YouTube-VOS 2019 | XMem (MS) | Overall | 86.4 | — | Unverified |