SOTAVerified

Provably robust classification of adversarial examples with detection

2021-01-01ICLR 2021Code Available0· sign in to hype

Fatemeh Sheikholeslami, Ali Lotfi, J Zico Kolter

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Adversarial attacks against deep networks can be defended against either by building robust classifiers or, by creating classifiers that can detect the presence of adversarial perturbations. Although it may intuitively seem easier to simply detect attacks rather than build a robust classifier, this has not bourne out in practice even empirically, as most detection methods have subsequently been broken by adaptive attacks, thus necessitating verifiable performance for detection mechanisms. In this paper, we propose a new method for jointly training a provably robust classifier and detector. Specifically, we show that by introducing an additional "abstain/detection" into a classifier, we can modify existing certified defense mechanisms to allow the classifier to either robustly classify or detect adversarial attacks. We extend the common interval bound propagation (IBP) method for certified robustness under _ perturbations to account for our new robust objective, and show that the method outperforms traditional IBP used in isolation, especially for large perturbation sizes. Specifically, tests on MNIST and CIFAR-10 datasets exhibit promising results, for example with provable robust error less than 63.63\% and 67.92\%, for 55.6\% and 66.37\% natural error, for =8/255 and 16/255 on the CIFAR-10 dataset, respectively.

Tasks

Reproductions