Efficient Extraction of Pseudo-Parallel Sentences from Raw Monolingual Data Using Word Embeddings

2017-07-01ACL 2017Unverified0· sign in to hype

Benjamin Marie, Atsushi Fujita

Unverified — Be the first to reproduce this paper.

Abstract

We propose a new method for extracting pseudo-parallel sentences from a pair of large monolingual corpora, without relying on any document-level information. Our method first exploits word embeddings in order to efficiently evaluate trillions of candidate sentence pairs and then a classifier to find the most reliable ones. We report significant improvements in domain adaptation for statistical machine translation when using a translation model trained on the sentence pairs extracted from in-domain monolingual corpora.

Tasks

Domain Adaptation Information Retrieval Machine Translation Sentence Translation Word Embeddings

Efficient Extraction of Pseudo-Parallel Sentences from Raw Monolingual Data Using Word Embeddings

Abstract

Tasks

Reproductions