BootsTAP: Bootstrapped Training for Tracking-Any-Point

2024-02-01Code Available5· sign in to hype

Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, João Carreira, Andrew Zisserman

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/google-deepmind/tapnet
OfficialIn paperjax★ 1,820
github.com/deepmind/tapnet
jax★ 1,820

Abstract

To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale groundtruth training data for TAP is only available in simulation, which currently has a limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a selfsupervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%. For visualizations, see our project webpage at https://bootstap.github.io/

Tasks

Point Tracking

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
TAP-Vid-DAVIS	BootsTAPIR	Average Jaccard	66.2	—	Unverified
TAP-Vid-Kinetics	BootsTAPIR	Average Jaccard	61.4	—	Unverified
TAP-Vid-RGB-Stacking	BootsTAPIR	Average Jaccard	72.4	—	Unverified

BootsTAP: Bootstrapped Training for Tracking-Any-Point

Code

Abstract

Tasks

Benchmark Results

Reproductions