SOTAVerified

Data-Driven Morphological Analysis and Disambiguation for Morphologically Rich Languages and Universal Dependencies

2016-12-01COLING 2016Code Available0· sign in to hype

Amir More, Reut Tsarfaty

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Parsing texts into universal dependencies (UD) in realistic scenarios requires infrastructure for the morphological analysis and disambiguation (MA\&D) of typologically different languages as a first tier. MA\&D is particularly challenging in morphologically rich languages (MRLs), where the ambiguous space-delimited tokens ought to be disambiguated with respect to their constituent morphemes, each morpheme carrying its own tag and a rich set features. Here we present a novel, language-agnostic, framework for MA\&D, based on a transition system with two variants --- word-based and morpheme-based --- and a dedicated transition to mitigate the biases of variable-length morpheme sequences. Our experiments on a Modern Hebrew case study show state of the art results, and we show that the morpheme-based MD consistently outperforms our word-based variant. We further illustrate the utility and multilingual coverage of our framework by morphologically analyzing and disambiguating the large set of languages in the UD treebanks.

Tasks

Reproductions