Impact of Corpora Quality on Neural Machine Translation

2018-10-19Code Available0· sign in to hype

Matīss Rikters

Code Available — Be the first to reproduce this paper.

Code

github.com/M4t1ss/parallel-corpora-tools
OfficialIn papernone★ 0

Abstract

Large parallel corpora that are automatically obtained from the web, documents or elsewhere often exhibit many corrupted parts that are bound to negatively affect the quality of the systems and models that learn from these corpora. This paper describes frequent problems found in data and such data affects neural machine translation systems, as well as how to identify and deal with them. The solutions are summarised in a set of scripts that remove problematic sentences from input corpora.

Tasks

Machine Translation Translation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
WMT 2017 English-Latvian	Transformer trained on highly filtered data	BLEU	22.89	—	Unverified
WMT 2017 Latvian-English	Transformer trained on highly filtered data	BLEU	24.37	—	Unverified
WMT 2018 English-Finnish	Transformer trained on highly filtered data	BLEU	17.4	—	Unverified
WMT 2018 Finnish-English	Transformer trained on highly filtered data	BLEU	24	—	Unverified

Impact of Corpora Quality on Neural Machine Translation

Code

Abstract

Tasks

Benchmark Results

Reproductions