SOTAVerified

Decipherment for Adversarial Offensive Language Detection

2018-10-01WS 2018Unverified0· sign in to hype

Zhelun Wu, Nishant Kambhatla, Anoop Sarkar

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Automated filters are commonly used by online services to stop users from sending age-inappropriate, bullying messages, or asking others to expose personal information. Previous work has focused on rules or classifiers to detect and filter offensive messages, but these are vulnerable to cleverly disguised plaintext and unseen expressions especially in an adversarial setting where the users can repeatedly try to bypass the filter. In this paper, we model the disguised messages as if they are produced by encrypting the original message using an invented cipher. We apply automatic decipherment techniques to decode the disguised malicious text, which can be then filtered using rules or classifiers. We provide experimental results on three different datasets and show that decipherment is an effective tool for this task.

Tasks

Reproductions