On the Development of a Large Scale Corpus for Native Language Identification

2018-12-10TLT17 2018Code Available0· sign in to hype

Thomas Hudson, Sardar Jaf

Code Available — Be the first to reproduce this paper.

Code

github.com/ghomasHudson/italkiCorpus
In papernone★ 0

Abstract

Native Language Identification (NLI) is the task of identifying an author’s native language from their writings in a second language. In this paper, we introduce a new corpus (italki), which is larger than the current corpora. It can be used for training machine learning based systems for classifying and identifying the native language of authors of English text. To examine the usefulness of italki, we evaluate it by using it to train and test some of the well performing NLI systems presented in the 2017 NLI shared task. In this paper, we present some aspects of italki. We show the impact of the variation of italki’s training dataset size of some languages on systems performance. From our empirical finding, we highlight the potential of italki as a large scale corpus for training machine learning classifiers for classifying the native language of authors from their written English text. We obtained promising results that show the potential of italki to improve the performance of current NLI systems. More importantly, we found that training the current NLI systems on italki generalize better than training them on the current corpora.

Tasks

BIG-bench Machine Learning Language Identification Native Language Identification

On the Development of a Large Scale Corpus for Native Language Identification

Code

Abstract

Tasks

Reproductions