SOTAVerified

A hybrid pipeline of rules and machine learning to filter web-crawled parallel corpora

2018-10-01WS 2018Unverified0· sign in to hype

Eduard Barbu, Verginica Barbu Mititelu

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

A hybrid pipeline comprising rules and machine learning is used to filter a noisy web English-German parallel corpus for the Parallel Corpus Filtering task. The core of the pipeline is a module based on the logistic regression algorithm that returns the probability that a translation unit is accepted. The training set for the logistic regression is created by automatic annotation. The quality of the automatic annotation is estimated by manually labeling the training set.

Tasks

Reproductions