Vanilla Classifiers for Distinguishing between Similar Languages

2016-12-01WS 2016Unverified0· sign in to hype

Sergiu Nisioi, Alina Maria Ciobanu, Liviu P. Dinu

Unverified — Be the first to reproduce this paper.

Abstract

In this paper we describe the submission of the UniBuc-NLP team for the Discriminating between Similar Languages Shared Task, DSL 2016. We present and analyze the results we obtained in the closed track of sub-task 1 (Similar languages and language varieties) and sub-task 2 (Arabic dialects). For sub-task 1 we used a logistic regression classifier with tf-idf feature weighting and for sub-task 2 a character-based string kernel with an SVM classifier. Our results show that good accuracy scores can be obtained with limited feature and model engineering. While certain limitations are to be acknowledged, our approach worked surprisingly well for out-of-domain, social media data, with 0.898 accuracy (3rd place) for dataset B1 and 0.838 accuracy (4th place) for dataset B2.

Tasks

Information Retrieval Language Identification Question Answering regression Task 2

Vanilla Classifiers for Distinguishing between Similar Languages

Abstract

Tasks

Reproductions