SOTAVerified

Lexical Induction of Morphological and Orthographic Forms for Low-Resourced Languages

2020-12-01MSR (COLING) 2020Unverified0· sign in to hype

Taha Tobaili

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

In this work we address the issue of high-degree lexical sparsity for non-standard languages under severe circumstance of small resources that are considered insufficient to train recent powerful language models. We proposed a new rule-based approach and utilised word embeddings to connect words with their inflectional and orthographic forms from a given corpus. Our case example is the low-resourced Lebanese dialect Arabizi. Arabizi is the name given to a new social transcription of the spoken Arabic in Latin script. The term comes from the portmanteau of Araby (Arabic) and Englizi (English). It is an informal written language where Arabs transcribe their dialectal mother tongue in text using Latin alphanumeral instead of Arabic script. For example حبيبي Ḥabībī my-love could be transcribed as 7abibi in Arabizi. We induced 175K forms from a list of 1.7K sentiment words. We evaluated this induction extrinsically on a sentiment-annotated dataset pushing its coverage by 13% over the previous version. We named the new lexicon SenZi-Large and released it publicly.

Tasks

Reproductions