SOTAVerified

Benchmarking Post-Hoc Interpretability Approaches for Transformer-based Misogyny Detection

2022-05-01nlppower (ACL) 2022Code Available0· sign in to hype

Giuseppe Attanasio, Debora Nozza, Eliana Pastor, Dirk Hovy

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Transformer-based Natural Language Processing models have become the standard for hate speech detection. However, the unconscious use of these techniques for such a critical task comes with negative consequences. Various works have demonstrated that hate speech classifiers are biased. These findings have prompted efforts to explain classifiers, mainly using attribution methods. In this paper, we provide the first benchmark study of interpretability approaches for hate speech detection. We cover four post-hoc token attribution approaches to explain the predictions of Transformer-based misogyny classifiers in English and Italian. Further, we compare generated attributions to attention analysis. We find that only two algorithms provide faithful explanations aligned with human expectations. Gradient-based methods and attention, however, show inconsistent outputs, making their value for explanations questionable for hate speech detection tasks.

Tasks

Reproductions