Subword-level Word Vector Representations for Korean

2018-07-01ACL 2018Code Available0· sign in to hype

Sungjoon Park, Jeongmin Byun, Sion Baek, Yongseok Cho, Alice Oh

Code Available — Be the first to reproduce this paper.

Code

github.com/SungjoonPark/KoreanWordVectors
OfficialIn papernone★ 0

Abstract

Research on distributed word representations is focused on widely-used languages such as English. Although the same methods can be used for other languages, language-specific knowledge can enhance the accuracy and richness of word vector representations. In this paper, we look at improving distributed word representations for Korean using knowledge about the unique linguistic structure of Korean. Specifically, we decompose Korean words into the jamo-level, beyond the character-level, allowing a systematic use of subword information. To evaluate the vectors, we develop Korean test sets for word similarity and analogy and make them publicly available. The results show that our simple method outperforms word2vec and character-level Skip-Grams on semantic and syntactic similarity and analogy tasks and contributes positively toward downstream NLP tasks such as sentiment analysis.

Tasks

Document Classification Language Modeling Language Modelling Machine Translation Sentiment Analysis Text Classification Word Similarity

Subword-level Word Vector Representations for Korean

Code

Abstract

Tasks

Reproductions