Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks

2025-01-18Code Available0· sign in to hype

Xin Yi, Yue Li, Dongsheng Shi, LinLin Wang, Xiaoling Wang, Liang He

Code Available — Be the first to reproduce this paper.

Code

github.com/xinykou/Against_Jailbreak
Officialjax★ 5

Abstract

Ensuring safety alignment is a critical requirement for large language models (LLMs), particularly given increasing deployment in real-world applications. Despite considerable advancements, LLMs remain susceptible to jailbreak attacks, which exploit system vulnerabilities to circumvent safety measures and elicit harmful or inappropriate outputs. Furthermore, while adversarial training-based defense methods have shown promise, a prevalent issue is the unintended over-defense behavior, wherein models excessively reject benign queries, significantly undermining their practical utility. To address these limitations, we introduce LATPC, a Latent-space Adversarial Training with Post-aware Calibration framework. LATPC dynamically identifies safety-critical latent dimensions by contrasting harmful and benign inputs, enabling the adaptive construction of targeted refusal feature removal attacks. This mechanism allows adversarial training to concentrate on real-world jailbreak tactics that disguise harmful queries as benign ones. During inference, LATPC employs an efficient embedding-level calibration mechanism to minimize over-defense behaviors with negligible computational overhead. Experimental results across five types of disguise-based jailbreak attacks demonstrate that LATPC achieves a superior balance between safety and utility compared to existing defense frameworks. Further analysis demonstrates the effectiveness of leveraging safety-critical dimensions in developing robust defense methods against jailbreak attacks.

Tasks

Safety Alignment

Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks

Code

Abstract

Tasks

Reproductions