An Unsupervised System for Parallel Corpus Filtering

2018-10-01WS 2018Unverified0· sign in to hype

Viktor Hangya, Alex Fraser, er

Unverified — Be the first to reproduce this paper.

Abstract

In this paper we describe LMU Munich's submission for the WMT 2018 Parallel Corpus Filtering shared task which addresses the problem of cleaning noisy parallel corpora. The task of mining and cleaning parallel sentences is important for improving the quality of machine translation systems, especially for low-resource languages. We tackle this problem in a fully unsupervised fashion relying on bilingual word embeddings created without any bilingual signal. After pre-filtering noisy data we rank sentence pairs by calculating bilingual sentence-level similarities and then remove redundant data by employing monolingual similarity as well. Our unsupervised system achieved good performance during the official evaluation of the shared task, scoring only a few BLEU points behind the best systems, while not requiring any parallel training data.

Tasks

Domain Adaptation Language Modeling Language Modelling Machine Translation Sentence Sentence Embedding Text Simplification Translation Word Embeddings

An Unsupervised System for Parallel Corpus Filtering

Abstract

Tasks

Reproductions