SOTAVerified

Splitting compounds with ngrams

2016-12-01COLING 2016Unverified0· sign in to hype

Naomi Tachikawa Shapiro

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Compound words with unmarked word boundaries are problematic for many tasks in NLP and computational linguistics, including information extraction, machine translation, and syllabification. This paper introduces a simple, proof-of-concept language modeling approach to automatic compound segmentation, as applied to Finnish. This approach utilizes an off-the-shelf morphological analyzer to split training words into their constituent morphemes. A language model is subsequently trained on ngrams composed of morphemes, morpheme boundaries, and word boundaries. Linguistic constraints are then used to weed out phonotactically ill-formed segmentations, thereby allowing the language model to select the best grammatical segmentation. This approach achieves an accuracy of 97\%.

Tasks

Reproductions