Arabic dialect identification: An Arabic-BERT model with data augmentation and ensembling strategy

2020-12-01COLING (WANLP) 2020Unverified0· sign in to hype

Kamel Gaanoun, Imade Benelallam

Unverified — Be the first to reproduce this paper.

Abstract

This paper presents the ArabicProcessors team’s deep learning system designed for the NADI 2020 Subtask 1 (country-level dialect identification) and Subtask 2 (province-level dialect identification). We used Arabic-Bert in combination with data augmentation and ensembling methods. Unlabeled data provided by task organizers (10 Million tweets) was split into multiple subparts, to which we applied semi-supervised learning method, and finally ran a specific ensembling process on the resulting models. This system ranked 3rd in Subtask 1 with 23.26% F1-score and 2nd in Subtask 2 with 5.75% F1-score.

Tasks

Data Augmentation Dialect Identification

Arabic dialect identification: An Arabic-BERT model with data augmentation and ensembling strategy

Abstract

Tasks

Reproductions