A Comprehensive Dataset for German Offensive Language and Conversation Analysis
Christoph Demus, Jonas Pitz, Mina Schütz, Nadine Probol, Melanie Siegel, Dirk Labudde
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/hdasprachtechnologie/detoxOfficialIn papernone★ 19
Abstract
In this work, we present a new publicly available offensive language dataset of 10.278 German social media comments collected in the first half of 2021 that were annotated by in total six annotators. With twelve different annotation categories, it is far more comprehensive than other datasets, and goes beyond just hate speech detection. The labels aim in particular also at toxicity, criminal relevance and discrimination types of comments.Furthermore, about half of the comments are from coherent parts of conversations, which opens the possibility to consider the comments’ contexts and do conversation analyses in order to research the contagion of offensive language in conversations.