LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models

2026-03-10Unverified0· sign in to hype

Guolei Huang, Qinzhi Peng, Gan Xu, Yao Huang, Yuxuan Lu, Yongjun Shen

Unverified — Be the first to reproduce this paper.

Abstract

As Vision-Language Models (VLMs) move into interactive, multi-turn use, safety concerns intensify for multimodal multi-turn dialogue, which is characterized by concealment of malicious intent, contextual risk accumulation, and cross-modal joint risk. These characteristics limit the effectiveness of content moderation approaches designed for single-turn or single-modality settings. To address these limitations, we first construct the Multimodal Multi-turn Dialogue Safety (MMDS) dataset, comprising 4,484 annotated dialogues and a comprehensive risk taxonomy with 8 primary and 60 subdimensions. As part of MMDS construction, we introduce Multimodal Multi-turn Red Teaming (MMRT), an automated framework for generating unsafe multimodal multi-turn dialogues. We further propose LLaVAShield, which audits the safety of both user inputs and assistant responses under specified policy dimensions in multimodal multi-turn dialogues. Extensive experiments show that LLaVAShield significantly outperforms state-of-the-art VLMs and existing content moderation tools while demonstrating strong generalization and flexible policy adaptation. Additionally, we analyze vulnerabilities of mainstream VLMs to harmful inputs and evaluate the contribution of key components, advancing understanding of safety mechanisms in multimodal multi-turn dialogues.

LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models

Abstract

Reproductions