SOTAVerified

Impact of Corpora Quality on Neural Machine Translation

2018-10-19Code Available0· sign in to hype

Matīss Rikters

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Large parallel corpora that are automatically obtained from the web, documents or elsewhere often exhibit many corrupted parts that are bound to negatively affect the quality of the systems and models that learn from these corpora. This paper describes frequent problems found in data and such data affects neural machine translation systems, as well as how to identify and deal with them. The solutions are summarised in a set of scripts that remove problematic sentences from input corpora.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
WMT 2017 English-LatvianTransformer trained on highly filtered dataBLEU22.89Unverified
WMT 2017 Latvian-EnglishTransformer trained on highly filtered dataBLEU24.37Unverified
WMT 2018 English-FinnishTransformer trained on highly filtered dataBLEU17.4Unverified
WMT 2018 Finnish-EnglishTransformer trained on highly filtered dataBLEU24Unverified

Reproductions