EnTaCs: Analyzing the Relationship Between Sentiment and Language Choice in English-Tamil Code-Switching

2026-03-27Unverified0· sign in to hype

Paul Bontempo

Unverified — Be the first to reproduce this paper.

Abstract

This paper investigates the relationship between utterance sentiment and language choice in English-Tamil code-switched text, using methods from machine learning and statistical modelling. We apply a fine-tuned XLM-RoBERTa model for token-level language identification on 35,650 romanized YouTube comments from the DravidianCodeMix dataset, producing per-utterance measurements of English proportion and language switch frequency. Linear regression analysis reveals that positive utterances exhibit significantly greater English proportion (34.3%) than negative utterances (24.8%), and mixed-sentiment utterances show the highest language switch frequency when controlling for utterance length. These findings support the hypothesis that emotional content demonstrably influences language choice in multilingual code-switching settings, due to socio-linguistic associations of prestige and identity with embedded and matrix languages.

EnTaCs: Analyzing the Relationship Between Sentiment and Language Choice in English-Tamil Code-Switching

Abstract

Reproductions