Prompsit's submission to WMT 2018 Parallel Corpus Filtering shared task

2018-10-01WS 2018Code Available1· sign in to hype

V{\'\i}ctor M. S{\'a}nchez-Cartagena, Marta Ba{\~n}{\'o}n, Sergio Ortiz-Rojas, Gema Ram{\'\i}rez

Code Available — Be the first to reproduce this paper.

Code

github.com/bitextor/bicleaner
OfficialIn papernone★ 160

Abstract

This paper describes Prompsit Language Engineering's submissions to the WMT 2018 parallel corpus filtering shared task. Our four submissions were based on an automatic classifier for identifying pairs of sentences that are mutual translations. A set of hand-crafted hard rules for discarding sentences with evident flaws were applied before the classifier. We explored different strategies for achieving a training corpus with diverse vocabulary and fluent sentences: language model scoring, an active-learning-inspired data selection algorithm and n-gram saturation. Our submissions were very competitive in comparison with other participants on the 100 million word training corpus.

Tasks

Active Learning Language Modeling Language Modelling Machine Translation

Prompsit's submission to WMT 2018 Parallel Corpus Filtering shared task

Code

Abstract

Tasks

Reproductions