Multilingual Multimodal Pretraining for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

2020-12-07Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Abstract

This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextualized multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pretraining strategy, and collect a new multilingual instructional video dataset (Multi-HowTo100M) for pretraining. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX; as well as in multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100M will be made available.

Tasks

Cross-Lingual Transfer Image Retrieval Text-to-video search Zero-Shot Cross-Lingual Transfer

Multilingual Multimodal Pretraining for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

Abstract

Tasks

Reproductions