Saliency-Guided DETR for Moment Retrieval and Highlight Detection

2024-10-02Code Available1· sign in to hype

Aleksandr Gordeev, Vladimir Dokholyan, Irina Tolstykh, Maksim Kuprashevich

Code Available — Be the first to reproduce this paper.

Code

github.com/ai-forever/sg-detr
OfficialIn papernone★ 15

Abstract

Existing approaches for video moment retrieval and highlight detection are not able to align text and video features efficiently, resulting in unsatisfying performance and limited production usage. To address this, we propose a novel architecture that utilizes recent foundational video models designed for such alignment. Combined with the introduced Saliency-Guided Cross Attention mechanism and a hybrid DETR architecture, our approach significantly enhances performance in both moment retrieval and highlight detection tasks. For even better improvement, we developed InterVid-MR, a large-scale and high-quality dataset for pretraining. Using it, our architecture achieves state-of-the-art results on the QVHighlights, Charades-STA and TACoS benchmarks. The proposed approach provides an efficient and scalable solution for both zero-shot and fine-tuning scenarios in video-language tasks.

Tasks

Highlight Detection Moment Retrieval Natural Language Moment Retrieval Natural Language Queries Retrieval Temporal Action Localization Zero-shot Moment Retrieval

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
TvSum	SG-DETR	mAP	87.1	—	Unverified
YouTube Highlights	SG-DETR (w/ PT)	mAP	78	—	Unverified
YouTube Highlights	SG-DETR	mAP	76.7	—	Unverified

Saliency-Guided DETR for Moment Retrieval and Highlight Detection

Code

Abstract

Tasks

Benchmark Results

Reproductions