SOTAVerified

Probing Toxic Content in Large Pre-Trained Language Models

2021-08-01ACL 2021Code Available0· sign in to hype

Nedjma Ousidhoum, Xinran Zhao, Tianqing Fang, Yangqiu Song, Dit-yan Yeung

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Large pre-trained language models (PTLMs) have been shown to carry biases towards different social groups which leads to the reproduction of stereotypical and toxic content by major NLP systems. We propose a method based on logistic regression classifiers to probe English, French, and Arabic PTLMs and quantify the potentially harmful content that they convey with respect to a set of templates. The templates are prompted by a name of a social group followed by a cause-effect relation. We use PTLMs to predict masked tokens at the end of a sentence in order to examine how likely they enable toxicity towards specific communities. We shed the light on how such negative content can be triggered within unrelated and benign contexts based on evidence from a large-scale study, then we explain how to take advantage of our methodology to assess and mitigate the toxicity transmitted by PTLMs.

Tasks

Reproductions