Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness

2025-01-16Code Available0· sign in to hype

Zeyu Wang, Cihang Xie, Brian Bartoldson, Bhavya Kailkhura

Code Available — Be the first to reproduce this paper.

Code

github.com/zw615/double_visual_defense
Official★ 8

Abstract

This paper investigates the robustness of vision-language models against adversarial visual perturbations and introduces a novel ``double visual defense" to enhance this robustness. Unlike previous approaches that resort to lightweight adversarial fine-tuning of a pre-trained CLIP model, we perform large-scale adversarial vision-language pre-training from scratch using web-scale data. We then strengthen the defense by incorporating adversarial visual instruction tuning. The resulting models from each stage, CLIP and ^2LLaVA, show substantially enhanced zero-shot robustness and set a new state-of-the-art in adversarial defense for vision-language models. For example, the adversarial robustness of CLIP surpasses that of the previous best models on ImageNet-1k by ~20%. %For example, CLIP surpasses the previous best models on ImageNet-1k by ~20% in terms of adversarial robustness. Similarly, compared to prior art, ^2LLaVA brings a ~30% robustness improvement to image captioning task and a ~20% robustness improvement to visual question answering task. Furthermore, our models exhibit stronger zero-shot recognition capability, fewer hallucinations, and superior reasoning performance compared to baselines. Our project page is https://doublevisualdefense.github.io/.

Tasks

Adversarial Defense Adversarial Robustness Image Captioning Language Modeling Language Modelling Question Answering Visual Question Answering Zero-Shot Learning

Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness

Code

Abstract

Tasks

Reproductions