Intriguing class-wise properties of adversarial training

2021-01-01Unverified0· sign in to hype

Qi Tian, Kun Kuang, Fei Wu, Yisen Wang

Unverified — Be the first to reproduce this paper.

Abstract

Adversarial training is one of the most effective approaches to improve model robustness against adversarial examples. However, previous works mainly focus on the overall robustness of the model, and the in-depth analysis on the role of each class involved in adversarial training is still missing. In this paper, we provide the first detailed class-wise diagnosis of adversarial training on six widely used datasets, i.e., MNIST, CIFAR-10, CIFAR-100, SVHN, STL-10 and ImageNet. Surprisingly, we find that there are remarkable robustness discrepancies among classes, demonstrating the following intriguing properties: 1) Many examples from a certain class could only be maliciously attacked to some specific semantic-similar classes, and these examples will not exist adversarial counterparts in bounded -ball if we re-train the model without those specific classes; 2) The robustness of each class is positively correlated with its norm of classifier weight in deep neural networks; 3) Stronger attacks are usually more powerful for vulnerable classes, and we empirically propose a simple but effective attack to further verify these vulnerable classes are major hidden dangers of the robust model. We believe these findings can contribute to a more comprehensive understanding of adversarial training as well as further improvement of adversarial robustness.

Tasks

Adversarial Robustness

Intriguing class-wise properties of adversarial training

Abstract

Tasks

Reproductions