SOTAVerified

Towards a Standardized Dataset on Indonesian Named Entity Recognition

2020-12-01Asian Chapter of the Association for Computational LinguisticsCode Available1· sign in to hype

Siti Oryza Khairunnisa, Aizhan Imankulova, Mamoru Komachi

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

In recent years, named entity recognition (NER) tasks in the Indonesian language have undergone extensive development. There are only a few corpora for Indonesian NER; hence, recent Indonesian NER studies have used diverse datasets. Although an open dataset is available, it includes only approximately 2,000 sentences and contains inconsistent annotations, thereby preventing accurate training of NER models without reliance on pre-trained models. Therefore, we re-annotated the dataset and compared the two annotations' performance using the Bidirectional Long Short-Term Memory and Conditional Random Field (BiLSTM-CRF) approach. Fixing the annotation yielded a more consistent result for the organization tag and improved the prediction score by a large margin. Moreover, to take full advantage of pre-trained models, we compared different feature embeddings to determine their impact on the NER task for the Indonesian language.

Tasks

Reproductions