Weighted combination of BERT and N-GRAM features for Nuanced Arabic Dialect Identification

2020-12-01COLING (WANLP) 2020Unverified0· sign in to hype

Abdellah El Mekki, Ahmed Alami, Hamza Alami, Ahmed Khoumsi, Ismail Berrada

Unverified — Be the first to reproduce this paper.

Abstract

Around the Arab world, different Arabic dialects are spoken by more than 300M persons, and are increasingly popular in social media texts. However, Arabic dialects are considered to be low-resource languages, limiting the development of machine-learning based systems for these dialects. In this paper, we investigate the Arabic dialect identification task, from two perspectives: country-level dialect identification from 21 Arab countries, and province-level dialect identification from 100 provinces. We introduce an unified pipeline of state-of-the-art models, that can handle the two subtasks. Our experimental studies applied to the NADI shared task, show promising results both at the country-level (F1-score of 25.99%) and the province-level (F1-score of 6.39%), and thus allow us to be ranked 2nd for the country-level subtask, and 1st in the province-level subtask.

Tasks

Dialect Identification

Weighted combination of BERT and N-GRAM features for Nuanced Arabic Dialect Identification

Abstract

Tasks

Reproductions