Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities
Sina Ahmadi, Antonios Anastasopoulos
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/sinaahmadi/scriptnormalizationOfficialIn papernone★ 2
Abstract
The wide accessibility of social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages. This, however, comes with certain challenges in script normalization, particularly where the speakers of a language in a bilingual community rely on another script or orthography to write their native language. This paper addresses the problem of script normalization for several such languages that are mainly written in a Perso-Arabic script. Using synthetic data with various levels of noise and a transformer-based model, we demonstrate that the problem can be effectively remediated. We conduct a small-scale evaluation of real data as well. Our experiments indicate that script normalization is also beneficial to improve the performance of downstream tasks such as machine translation and language identification.