Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

2021-12-03Code Available1· sign in to hype

Fan Hu, Aozhu Chen, Ziyue Wang, Fangming Zhou, Jianfeng Dong, Xirong Li

Code Available — Be the first to reproduce this paper.

Code

github.com/ruc-aimc-lab/laff
OfficialIn paperpytorch★ 46

Abstract

In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. We propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. The interpretability of LAFF can be used for feature selection. Extensive experiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX and TRECVID AVS 2016-2020) justify LAFF as a new baseline for text-to-video retrieval.

Tasks

Ad-hoc video search feature selection Retrieval Text to Video Retrieval Video Retrieval

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
TRECVID-AVS16 (IACC.3)	LAFF	infAP	0.22	—	Unverified
TRECVID-AVS17 (IACC.3)	LAFF	infAP	0.29	—	Unverified
TRECVID-AVS18 (IACC.3)	LAFF	infAP	0.15	—	Unverified
TRECVID-AVS19 (V3C1)	LAFF	infAP	0.19	—	Unverified
TRECVID-AVS20 (V3C1)	LAFF	infAP	0.27	—	Unverified

Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

Code

Abstract

Tasks

Benchmark Results

Reproductions