UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database

2020-12-01Asian Chapter of the Association for Computational LinguisticsCode Available0· sign in to hype

Canwen Xu, Tao Ge, Chenliang Li, Furu Wei

Code Available — Be the first to reproduce this paper.

Code

github.com/jetrunner/unihan-lm
OfficialIn papernone★ 2

Abstract

Chinese and Japanese share many characters with similar surface morphology. To better utilize the shared knowledge across the languages, we propose UnihanLM, a self-supervised Chinese-Japanese pretrained masked language model (MLM) with a novel two-stage coarse-to-fine training approach. We exploit Unihan, a ready-made database constructed by linguistic experts to first merge morphologically similar characters into clusters. The resulting clusters are used to replace the original characters in sentences for the coarse-grained pretraining of the MLM. Then, we restore the clusters back to the original characters in sentences for the fine-grained pretraining to learn the representation of the specific characters. We conduct extensive experiments on a variety of Chinese and Japanese NLP benchmarks, showing that our proposed UnihanLM is effective on both mono- and cross-lingual Chinese and Japanese tasks, shedding light on a new path to exploit the homology of languages.

Tasks

Language Modeling Language Modelling

UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database

Code

Abstract

Tasks

Reproductions