SOTAVerified

Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora

2017-09-01EMNLP 2017Unverified0· sign in to hype

Hainan Xu, Philipp Koehn

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

We introduce Zipporah, a fast and scalable data cleaning system. We propose a novel type of bag-of-words translation feature, and train logistic regression models to classify good data and synthetic noisy data in the proposed feature space. The trained model is used to score parallel sentences in the data pool for selection. As shown in experiments, Zipporah selects a high-quality parallel corpus from a large, mixed quality data pool. In particular, for one noisy dataset, Zipporah achieves a 2.1 BLEU score improvement with using 1/5 of the data over using the entire corpus.

Tasks

Reproductions