Generalizing Multimodal Pre-training into Multilingual via Language Acquisition

2022-05-29Unverified0· sign in to hype

Liang Zhang, Anwen Hu, Qin Jin

Unverified — Be the first to reproduce this paper.

Abstract

English-based Vision-Language Pre-training (VLP) has achieved great success in various downstream tasks. Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training (M-VLP). However, due to the large number of languages, M-VLP models often require huge computing resources and cannot be flexibly extended to new languages. In this work, we propose a MultiLingual Acquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual. Specifically, we design a lightweight language acquisition encoder based on state-of-the-art monolingual VLP models. We further propose a two-stage training strategy to optimize the language acquisition encoder, namely the Native Language Transfer stage and the Language Exposure stage. With much less multilingual training data and computing resources, our model achieves state-of-the-art performance on multilingual image-text and video-text retrieval benchmarks.

Tasks

Language Acquisition Retrieval Text Retrieval Video-Text Retrieval

Generalizing Multimodal Pre-training into Multilingual via Language Acquisition

Abstract

Tasks

Reproductions