New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian

2016-05-01LREC 2016Unverified0· sign in to hype

Nikola Ljube{\v{s}}i{\'c}, Filip Klubi{\v{c}}ka, {\v{Z}}eljko Agi{\'c}, Ivo-Pavao Jazbec

Unverified — Be the first to reproduce this paper.

Abstract

In this paper we present newly developed inflectional lexcions and manually annotated corpora of Croatian and Serbian. We introduce hrLex and srLex - two freely available inflectional lexicons of Croatian and Serbian - and describe the process of building these lexicons, supported by supervised machine learning techniques for lemma and paradigm prediction. Furthermore, we introduce hr500k, a manually annotated corpus of Croatian, 500 thousand tokens in size. We showcase the three newly developed resources on the task of morphosyntactic annotation of both languages by using a recently developed CRF tagger. We achieve best results yet reported on the task for both languages, beating the HunPos baseline trained on the same datasets by a wide margin.

Tasks

LEMMA

New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian

Abstract

Tasks

Reproductions