A Straightforward Framework For Video Retrieval Using CLIP
Jesús Andrés Portillo-Quintero, José Carlos Ortiz-Bayliss, Hugo Terashima-Marín
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/Deferf/CLIP_Video_RepresentationOfficialIn paperpytorch★ 70
Abstract
Video Retrieval is a challenging task where a text query is matched to a video or vice versa. Most of the existing approaches for addressing such a problem rely on annotations made by the users. Although simple, this approach is not always feasible in practice. In this work, we explore the application of the language-image model, CLIP, to obtain video representations without the need for said annotations. This model was explicitly trained to learn a common space where images and text can be compared. Using various techniques described in this document, we extended its application to videos, obtaining state-of-the-art results on the MSR-VTT and MSVD benchmarks.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| LSMDC | CLIP | text-to-video R@1 | 11.3 | — | Unverified |
| MSR-VTT | CLIP | text-to-video R@1 | 21.4 | — | Unverified |
| MSR-VTT-1kA | CLIP | text-to-video R@1 | 31.2 | — | Unverified |
| MSVD | CLIP | text-to-video R@1 | 37 | — | Unverified |