A hybrid pipeline of rules and machine learning to filter web-crawled parallel corpora
2018-10-01WS 2018Unverified0· sign in to hype
Eduard Barbu, Verginica Barbu Mititelu
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
A hybrid pipeline comprising rules and machine learning is used to filter a noisy web English-German parallel corpus for the Parallel Corpus Filtering task. The core of the pipeline is a module based on the logistic regression algorithm that returns the probability that a translation unit is accepted. The training set for the logistic regression is created by automatic annotation. The quality of the automatic annotation is estimated by manually labeling the training set.