A Text-Image Pair Is not Enough: Language-Vision Relation Inference with Auxiliary Modality Translation
Anonymous
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
The semantic relations between language and vision modalities become more and more vital since they can effectively facilitate the downstream multi-modal tasks, such as cross-modal retrieval, multi-modal sentiment analysis and entity recognition. Although several approaches have been proposed to handle language-vision relation inference (LVRI), they normally rely on the limited information of the posted single sentence and single image. In this paper, to extend the information width of the original input, we introduce a concept of modality translation with two potential directions to generate additional modalities, and propose the auxiliary modality translation framework (AMT) for LVRI. This approach can not only generate the additional image by translating original text, but also the additional text by translating original image. Moreover, towards the potential three or four modalities as input, we employ a unified layer-wise transformer structure to perform multi-modal interactions. Systematic experiments and extensive analysis demonstrate that our approach with auxiliary modality translation significantly outperforms conventional approaches of LVRI and several competitive baselines for other text-image classification tasks.