TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement
Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, Andrew Zisserman
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/deepmind/tapnetOfficialIn paperjax★ 1,820
- github.com/riponazad/echotrackerpytorch★ 56
- github.com/ibaiGorordo/Tapir-Pytorch-Inferencepytorch★ 18
Abstract
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Our model facilitates fast inference on long and high-resolution video sequences. On a modern GPU, our implementation has the capacity to track points faster than real-time, and can be flexibly extended to higher-resolution videos. Given the high-quality trajectories extracted from a large dataset, we demonstrate a proof-of-concept diffusion model which generates trajectories from static images, enabling plausible animations. Visualizations, source code, and pretrained models can be found on our project webpage.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| DAVIS | TAPIR (Panning MOVi-E) | Average Jaccard | 61.3 | — | Unverified |
| DAVIS | TAPIR (MOVi-E) | Average Jaccard | 59.8 | — | Unverified |
| Kinetics | TAPIR (MOVi-E) | Average Jaccard | 57.1 | — | Unverified |
| Kinetics | TAPIR (Panning MOVi-E) | Average Jaccard | 57.2 | — | Unverified |
| Kubric | TAPIR (Panning MOVi-E) | Average Jaccard | 84.7 | — | Unverified |
| Kubric | TAPIR (MOVi-E) | Average Jaccard | 84.3 | — | Unverified |
| RGB-Stacking | TAPIR (MOVi-E) | Average Jaccard | 66.2 | — | Unverified |
| RGB-Stacking | TAPIR (Panning MOVi-E) | Average Jaccard | 62.7 | — | Unverified |