A Simple Recipe for Multilingual Grammatical Error Correction

2021-06-07ACL 2021Code Available1· sign in to hype

Sascha Rothe, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn

Code Available — Be the first to reproduce this paper.

Code

github.com/google-research-datasets/clang8
OfficialIn papernone★ 112
github.com/gotutiyan/gec-t5
pytorch★ 8

Abstract

This paper presents a simple recipe to train state-of-the-art multilingual Grammatical Error Correction (GEC) models. We achieve this by first proposing a language-agnostic method to generate a large number of synthetic examples. The second ingredient is to use large-scale multilingual language models (up to 11B parameters). Once fine-tuned on language-specific supervised sets we surpass the previous state-of-the-art results on GEC benchmarks in four languages: English, Czech, German and Russian. Having established a new set of baselines for GEC, we make our results easily reproducible and accessible by releasing a cLang-8 dataset. It is produced by using our best model, which we call gT5, to clean the targets of a widely used yet noisy lang-8 dataset. cLang-8 greatly simplifies typical GEC training pipelines composed of multiple fine-tuning stages -- we demonstrate that performing a single fine-tuning step on cLang-8 with the off-the-shelf language models yields further accuracy improvements over an already top-performing gT5 model for English.

Tasks

Grammatical Error Correction

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
CoNLL-2014 Shared Task	T5	F0.5	68.87	—	Unverified
Falko-MERLIN	gT5 xxl	F0.5	75.96	—	Unverified

A Simple Recipe for Multilingual Grammatical Error Correction

Code

Abstract

Tasks

Benchmark Results

Reproductions