SOTAVerified

The RWTH Aachen University Filtering System for the WMT 2018 Parallel Corpus Filtering Task

2018-10-01WS 2018Unverified0· sign in to hype

Nick Rossenbach, Jan Rosendahl, Yunsu Kim, Miguel Gra{\c{c}}a, Aman Gokrani, Hermann Ney

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

This paper describes the submission of RWTH Aachen University for the De→En parallel corpus filtering task of the EMNLP 2018 Third Conference on Machine Translation (WMT 2018). We use several rule-based, heuristic methods to preselect sentence pairs. These sentence pairs are scored with count-based and neural systems as language and translation models. In addition to single sentence-pair scoring, we further implement a simple redundancy removing heuristic. Our best performing corpus filtering system relies on recurrent neural language models and translation models based on the transformer architecture. A model trained on 10M randomly sampled tokens reaches a performance of 9.2\% BLEU on newstest2018. Using our filtering and ranking techniques we achieve 34.8\% BLEU.

Tasks

Reproductions