Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT
Le Yu, Zhengyue Zhao, Yawen Zheng, Yunhao Liu
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily broken through a novel attack method termed Stealth Fine-Tuning. Our method elicits harmful reasoning traces through segment-level interference and reuses the self-generated outputs as supervised fine-tuning data. To facilitate this, we introduce a turn-based weighted loss that minimizes distribution shift. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.66\% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. redDisclaimer: This paper contains content that may be disturbing or offensive.