ParaCrawl: Web-Scale Acquisition of Parallel Corpora

2020-07-01ACL 2020Code Available1· sign in to hype

Marta Ba{\~n}{\'o}n, Pin-zhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel Espl{\`a}-Gomis, Mikel L. Forcada, Amir Kamran, Faheem Kirefu, Philipp Koehn, Sergio Ortiz Rojas, Leopoldo Pla Sempere, Gema Ram{\'\i}rez-S{\'a}nchez, Elsa Sarr{\'\i}as, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, Jaume Zaragoza

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/bitextor/bitextor
tf★ 301

Abstract

We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.

Tasks

Machine Translation Parallel Corpus Mining Sentence Translation

ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Code

Abstract

Tasks

Reproductions