SOTAVerified

Enhanced Urdu Word Segmentation using Conditional Random Fields and Morphological Context Features

2020-07-01WS 2020Unverified0· sign in to hype

Aamir Farhan, Mashrukh Islam, Dipti Misra Sharma

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Word segmentation is a fundamental task for most of the NLP applications. Urdu adopts Nastalique writing style which does not have a concept of space. Furthermore, the inherent non-joining attributes of certain characters in Urdu create spaces within a word while writing in digital format. Thus, Urdu not only has space omission but also space insertion issues which make the word segmentation task challenging. In this paper, we improve upon the results of Zia, Raza and Athar (2018) by using a manually annotated corpus of 19,651 sentences along with morphological context features. Using the Conditional Random Field sequence modeler, our model achieves F 1 score of 0.98 for word boundary identification and 0.92 for sub-word boundary identification tasks. The results demonstrated in this paper outperform the state-of-the-art methods.

Tasks

Reproductions