SOTAVerified

Identification of Languages in Algerian Arabic Multilingual Documents

2017-04-01WS 2017Unverified0· sign in to hype

Wafia Adouane, Simon Dobnik

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

This paper presents a language identification system designed to detect the language of each word, in its context, in a multilingual documents as generated in social media by bilingual/multilingual communities, in our case speakers of Algerian Arabic. We frame the task as a sequence tagging problem and use supervised machine learning with standard methods like HMM and Ngram classification tagging. We also experiment with a lexicon-based method. Combining all the methods in a fall-back mechanism and introducing some linguistic rules, to deal with unseen tokens and ambiguous words, gives an overall accuracy of 93.14\%. Finally, we introduced rules for language identification from sequences of recognised words.

Tasks

Reproductions