On Meaning-Preserving Adversarial Perturbations for Sequence-to-Sequence Models

2019-05-01ICLR 2019Unverified0· sign in to hype

Paul Michel, Graham Neubig, Xi-An Li, Juan Miguel Pino

Unverified — Be the first to reproduce this paper.

Abstract

Adversarial examples have been shown to be an effective way of assessing the robustness of neural sequence-to-sequence (seq2seq) models, by applying perturbations to the input of a model leading to large degradation in performance. However, these perturbations are only indicative of a weakness in the model if they do not change the semantics of the input in a way that would change the expected output. Using the example of machine translation (MT), we propose a new evaluation framework for adversarial attacks on seq2seq models taking meaning preservation into account and demonstrate that existing methods may not preserve meaning in general. Based on these findings, we propose new constraints for attacks on word-based MT systems and show, via human and automatic evaluation, that they produce more semantically similar adversarial inputs. Furthermore, we show that performing adversarial training with meaning-preserving attacks is beneficial to the model in terms of adversarial robustness without hurting test performance.

Tasks

Adversarial Robustness Machine Translation Translation

On Meaning-Preserving Adversarial Perturbations for Sequence-to-Sequence Models

Abstract

Tasks

Reproductions