SOTAVerified

Learning to Ignore Adversarial Attacks

2021-11-16ACL ARR November 2021Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Despite the strong performance of current NLP models, they can be brittle against adversarial attacks. To enable effective learning against adversarial inputs, we introduce the use of rationale models that can explicitly learn to ignore attack tokens. We find that the rationale models can ignore over 90\% of attack tokens. This approach leads to consistent sizable improvements (8\%) over baseline models in robustness, for both BERT and RoBERTa, on MultiRC and FEVER, and also reliably outperforms data augmentation with adversarial examples alone. In many cases, we find that our method is able to close the gap between model performance on a clean test set and an attacked test set, eliminating the effect of adversarial attacks.

Tasks

Reproductions