Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

2018-06-11ICMR 2018Code Available0· sign in to hype

Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, Amit K. Roy-Chowdhury

Code Available — Be the first to reproduce this paper.

Code

github.com/niluthpol/multimodal_vtt
pytorch★ 0

Abstract

Constructing a joint representation invariant across different modalities (e.g., video, language) is of significant importance in many multimedia applications. While there are a number of recent successes in developing effective image-text retrieval methods by learning joint representations, the video-text retrieval task, in contrast, has not been explored to its fullest extent. In this paper, we study how to effectively utilize available multi-modal cues from videos for the cross-modal video-text retrieval task. Based on our analysis, we propose a novel framework that simultaneously utilizes multimodal features (different visual characteristics, audio inputs, and text) by a fusion strategy for efficient retrieval. Furthermore, we explore several loss functions in training the joint embedding and propose a modified pairwise ranking loss for the retrieval task. Experiments on MSVD and MSR-VTT datasets demonstrate that our method achieves significant performance gain compared to the state-of-the-art approaches.

Tasks

Image-text Retrieval Retrieval Text Retrieval Video Retrieval Video-Text Retrieval

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

Code

Abstract

Tasks

Reproductions