SOTAVerified

Discriminating between Similar Languages using Weighted Subword Features

2017-04-01WS 2017Code Available0· sign in to hype

Adrien Barbaresi

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

The present contribution revolves around a contrastive subword n-gram model which has been tested in the Discriminating between Similar Languages shared task. I present and discuss the method used in this 14-way language identification task comprising varieties of 6 main language groups. It features the following characteristics: (1) the preprocessing and conversion of a collection of documents to sparse features; (2) weighted character n-gram profiles; (3) a multinomial Bayesian classifier. Meaningful bag-of-n-grams features can be used as a system in a straightforward way, my approach outperforms most of the systems used in the DSL shared task (3rd rank).

Tasks

Reproductions