EdaCSC: Two Easy Data Augmentation Methods for Chinese Spelling Correction

2024-09-08Code Available0· sign in to hype

Lei Sheng, Shuai-Shuai Xu

Code Available — Be the first to reproduce this paper.

Code

github.com/cycloneboy/csc_eda
OfficialIn paperpytorch★ 0

Abstract

Chinese Spelling Correction (CSC) aims to detect and correct spelling errors in Chinese sentences caused by phonetic or visual similarities. While current CSC models integrate pinyin or glyph features and have shown significant progress,they still face challenges when dealing with sentences containing multiple typos and are susceptible to overcorrection in real-world scenarios. In contrast to existing model-centric approaches, we propose two data augmentation methods to address these limitations. Firstly, we augment the dataset by either splitting long sentences into shorter ones or reducing typos in sentences with multiple typos. Subsequently, we employ different training processes to select the optimal model. Experimental evaluations on the SIGHAN benchmarks demonstrate the superiority of our approach over most existing models, achieving state-of-the-art performance on the SIGHAN15 test set.

Tasks

Data Augmentation Spelling Correction

EdaCSC: Two Easy Data Augmentation Methods for Chinese Spelling Correction

Code

Abstract

Tasks

Reproductions