SOTAVerified

Think Outside the Data: Colonial Biases and Systemic Issues in Automated Moderation Pipelines for Low-Resource Languages

2025-01-23Unverified0· sign in to hype

Farhana Shahid, Mona Elswah, Aditya Vashistha

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Most social media users come from non-English speaking countries in the Global South, where much of harmful content appears in local languages. Yet, current AI-driven moderation systems struggle with low-resource languages spoken in these regions. This work examines the systemic challenges in building automated moderation tools for these languages. We conducted semi-structured interviews with 22 AI experts working on detecting harmful content in four low-resource languages: Tamil (South Asia), Swahili (East Africa), Maghrebi Arabic (North Africa), and Quechua (South America). Our findings show that beyond the well-known data scarcity in local languages, technical issues--such as outdated machine translation systems, sentiment and toxicity models grounded in Western values, and unreliable language detection technologies--undermine moderation efforts. Even with more data, current language models and preprocessing pipelines--primarily designed for English--struggle with the morphological richness, linguistic complexity, and code-mixing. As a result, automated moderation in Tamil, Swahili, Arabic, and Quechua remains fraught with inaccuracies and blind spots. Based on our findings, we argue that these limitations are not just technical gaps but reflect deeper structural inequities that continue to reproduce historical power imbalances. We conclude by discussing multi-stakeholder approaches to improve automated moderation for low-resource languages.

Tasks

Reproductions