CNMBERT: A Model for Converting Hanyu Pinyin Abbreviations to Chinese Characters

2024-11-18Code Available2· sign in to hype

Zishuo Feng, Feng Cao

Code Available — Be the first to reproduce this paper.

Code

github.com/igarashiakatuki/cnmbert
OfficialIn paperpytorch★ 133

Abstract

The task of converting Hanyu Pinyin abbreviations to Chinese characters is a significant branch within the domain of Chinese Spelling Correction (CSC). It plays an important role in many downstream applications such as named entity recognition and sentiment analysis. This task typically involves text-length alignment and seems easy to solve; however, due to the limited information content in pinyin abbreviations, achieving accurate conversion is challenging. In this paper, we treat this as a fill-mask task and propose CNMBERT, which stands for zh-CN Pinyin Multi-mask BERT Model, as a solution to this issue. By introducing a multi-mask strategy and Mixture of Experts (MoE) layers, CNMBERT outperforms fine-tuned GPT models and ChatGPT-4o with a 61.53% MRR score and 51.86% accuracy on a 10,373-sample test dataset.

Tasks

fill-mask Fill Mask Mixture-of-Experts named-entity-recognition Named Entity Recognition Sentiment Analysis Spelling Correction

CNMBERT: A Model for Converting Hanyu Pinyin Abbreviations to Chinese Characters

Code

Abstract

Tasks

Reproductions