Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift

2022-12-15Code Available1· sign in to hype

JieLin Qiu, Yi Zhu, Xingjian Shi, Florian Wenzel, Zhiqiang Tang, Ding Zhao, Bo Li, Mu Li

Code Available — Be the first to reproduce this paper.

Code

github.com/jielin-qiu/mm_robustness
pytorch★ 38

Abstract

Multimodal image-text models have shown remarkable performance in the past few years. However, evaluating robustness against distribution shifts is crucial before adopting them in real-world applications. In this work, we investigate the robustness of 12 popular open-sourced image-text models under common perturbations on five tasks (image-text retrieval, visual reasoning, visual entailment, image captioning, and text-to-image generation). In particular, we propose several new multimodal robustness benchmarks by applying 17 image perturbation and 16 text perturbation techniques on top of existing datasets. We observe that multimodal models are not robust to image and text perturbations, especially to image perturbations. Among the tested perturbation methods, character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data. We also introduce two new robustness metrics (MMI for MultiModal Impact score and MOR for Missing Object Rate) for proper evaluations of multimodal models. We hope our extensive study sheds light on new directions for the development of robust multimodal models. More details can be found on the project webpage: https://MMRobustness.github.io.

Tasks

Benchmarking Image Captioning Image Generation Image-text Retrieval Retrieval Text Retrieval Text to Image Generation Text-to-Image Generation Visual Entailment Visual Reasoning

Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift

Code

Abstract

Tasks

Reproductions