SOTAVerified

Data Strategies for Low-Resource Grammatical Error Correction

2021-04-01EACL (BEA) 2021Unverified0· sign in to hype

Simon Flachs, Felix Stahlberg, Shankar Kumar

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Grammatical Error Correction (GEC) is a task that has been extensively investigated for the English language. However, for low-resource languages the best practices for training GEC systems have not yet been systematically determined. We investigate how best to take advantage of existing data sources for improving GEC systems for languages with limited quantities of high quality training data. We show that methods for generating artificial training data for GEC can benefit from including morphological errors. We also demonstrate that noisy error correction data gathered from Wikipedia revision histories and the language learning website Lang8, are valuable data sources. Finally, we show that GEC systems pre-trained on noisy data sources can be fine-tuned effectively using small amounts of high quality, human-annotated data.

Tasks

Reproductions