SOTAVerified

HMMs for Unsupervised Vietnamese WordSegmentation

2019-05-16Code Available0· sign in to hype

Ba-Long Bui, Thi-Trang Nguyen, Huu-Hoang Nguyen, Kiem-Hieu Nguyen

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Word segmentation is an important problem in nat-ural language processing. Most of previous works on Vietnameseword segmentation are supervised learning. In this paper, wepropose an unsupervised method for Vietnamese word segmenta-tion based on Hidden Markov Models. We naturally encode priorlinguistic knowledge into model learning. In decoding, we proposean enhancement of Viterbi decoding algorithm with externaltoken ordering statistics from Pointwise Mutual Information.Evaluation on benchmark datasets shows that the proposedmethod works reasonably well. Sourcecode is available at https://github.com/longbb/wordrecognition

Tasks

Reproductions