A Straightforward Framework For Video Retrieval Using CLIP

2021-02-24Code Available1· sign in to hype

Jesús Andrés Portillo-Quintero, José Carlos Ortiz-Bayliss, Hugo Terashima-Marín

Code Available — Be the first to reproduce this paper.

Code

github.com/Deferf/CLIP_Video_Representation
OfficialIn paperpytorch★ 70

Abstract

Video Retrieval is a challenging task where a text query is matched to a video or vice versa. Most of the existing approaches for addressing such a problem rely on annotations made by the users. Although simple, this approach is not always feasible in practice. In this work, we explore the application of the language-image model, CLIP, to obtain video representations without the need for said annotations. This model was explicitly trained to learn a common space where images and text can be compared. Using various techniques described in this document, we extended its application to videos, obtaining state-of-the-art results on the MSR-VTT and MSVD benchmarks.

Tasks

Retrieval Video Retrieval

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
LSMDC	CLIP	text-to-video R@1	11.3	—	Unverified
MSR-VTT	CLIP	text-to-video R@1	21.4	—	Unverified
MSR-VTT-1kA	CLIP	text-to-video R@1	31.2	—	Unverified
MSVD	CLIP	text-to-video R@1	37	—	Unverified

A Straightforward Framework For Video Retrieval Using CLIP

Code

Abstract

Tasks

Benchmark Results

Reproductions