MANGO: Enhancing the Robustness of VQA Models via Adversarial Noise Generation

2022-01-16ACL ARR January 2022Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Abstract

Large-scale pre-trained vision-and-language (V+L) transformers have propelled the state of the art (SOTA) on Visual Question Answering (VQA) task. Despite impressive performance on the standard VQA benchmark, it remains unclear how robust these models are. To investigate, we conduct a host of evaluations over 4 different types of robust VQA datasets: (i) Linguistic Variation; (ii) Logical Reasoning; (iii) Visual Content Manipulation; and (iv) Answer Distribution Shift. Experiments show that pre-trained V+L models already exhibit better robustness than many task-specific SOTA methods via standard model finetuning. To further enhance model robustness, we propose Mango, a generic and efficient approach that learns a Multimodal Adversarial Noise GeneratOr in the embedding space to fool V+L models. Differing from previous studies focused on one specific type of robustness, Mango is agnostic to robustness types, and enables universal performance lift for both task-specific and pre-trained models over diverse robust VQA datasets designed to evaluate broad aspects of robustness. Comprehensive experiments demonstrate that Mango achieves new SOTA on 7 out of 9 robustness benchmarks.

Tasks

Logical Reasoning Question Answering Visual Question Answering Visual Question Answering (VQA)

MANGO: Enhancing the Robustness of VQA Models via Adversarial Noise Generation

Abstract

Tasks

Reproductions