N-gram and Neural Models for Uralic Language Identification: NRC at VarDial 2021

2021-04-01EACL (VarDial) 2021Unverified0· sign in to hype

Gabriel Bernier-Colborne, Serge Leger, Cyril Goutte

Unverified — Be the first to reproduce this paper.

Abstract

We describe the systems developed by the National Research Council Canada for the Uralic language identification shared task at the 2021 VarDial evaluation campaign. We evaluated two different approaches to this task: a probabilistic classifier exploiting only character 5-grams as features, and a character-based neural network pre-trained through self-supervision, then fine-tuned on the language identification task. The former method turned out to perform better, which casts doubt on the usefulness of deep learning methods for language identification, where they have yet to convincingly and consistently outperform simpler and less costly classification algorithms exploiting n-gram features.

Tasks

Language Identification

N-gram and Neural Models for Uralic Language Identification: NRC at VarDial 2021

Abstract

Tasks

Reproductions