Discriminating between Similar Languages using Weighted Subword Features

2017-04-01WS 2017Code Available0· sign in to hype

Adrien Barbaresi

Code Available — Be the first to reproduce this paper.

Code

github.com/adbar/vardial-experiments
OfficialIn papernone★ 0

Abstract

The present contribution revolves around a contrastive subword n-gram model which has been tested in the Discriminating between Similar Languages shared task. I present and discuss the method used in this 14-way language identification task comprising varieties of 6 main language groups. It features the following characteristics: (1) the preprocessing and conversion of a collection of documents to sparse features; (2) weighted character n-gram profiles; (3) a multinomial Bayesian classifier. Meaningful bag-of-n-grams features can be used as a system in a straightforward way, my approach outperforms most of the systems used in the DSL shared task (3rd rank).

Tasks

Language Identification Text Categorization

Discriminating between Similar Languages using Weighted Subword Features

Code

Abstract

Tasks

Reproductions