Risk of Text Backdoor Attacks Under Dataset Distillation
Kejun Zhang, Yutuo Song, Shaofei Xu, Pengcheng Li, Rong Qian, Pengzhi Han, Lingyun Xu
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/Songsci1024/TBADDIn paperpytorch★ 3
Abstract
Dataset distillation aims to transfer knowledge from large training datasets into smaller datasets to enable rapid training of neural networks while maintaining the original dataset’s performance. However, current research on dataset distillation primarily focuses on balancing resource utilization and model capability, with limited discussion on the associated security risks, especially in the natural language processing (NLP) domain. In this paper, we focus on backdoor attacks on data distilled from text datasets. Specifically, we inject triggers into the synthetic dataset during the distillation process, rather than during the model training phase. We propose a framework for backdoor attacks in the context of text dataset distillation, termed Text Backdoor Attack under Dataset Distillation (TBADD). This framework is broadly applicable to backdoor attack methods based on dataset poisoning principles. It achieves an optimal balance between clean sample accuracy (CACC) and attack success rate (ASR) by separating clean and poisoned samples in the validation set and evaluating the distilled dataset’s performance through weighted assessment. Experimental comparisons using four popular backdoor attacks on two text classification tasks demonstrate that TBADD can achieve attack success rates comparable to those of models trained with the original dataset without significantly compromising the original task performance. Under two visible backdoor attacks, the ASR approaches 100%, while under two invisible backdoor attacks, the average ASR still achieves 83%, demonstrating effective attack outcomes. Our code is available at https://github.com/Songsci1024/TBADD.