Constructing a Chinese---Japanese Parallel Corpus from Wikipedia

2014-05-01LREC 2014Unverified0· sign in to hype

Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi

Unverified — Be the first to reproduce this paper.

Abstract

Parallel corpora are crucial for statistical machine translation (SMT). However, they are quite scarce for most language pairs, such as Chinese―Japanese. As comparable corpora are far more available, many studies have been conducted to automatically construct parallel corpora from comparable corpora. This paper presents a robust parallel sentence extraction system for constructing a Chinese―Japanese parallel corpus from Wikipedia. The system is inspired by previous studies that mainly consist of a parallel sentence candidate filter and a binary classifier for parallel sentence identification. We improve the system by using the common Chinese characters for filtering and two novel feature sets for classification. Experiments show that our system performs significantly better than the previous studies for both accuracy in parallel sentence extraction and SMT performance. Using the system, we construct a Chinese―Japanese parallel corpus with more than 126k highly accurate parallel sentences from Wikipedia. The constructed parallel corpus is freely available at http://orchid.kuee.kyoto-u.ac.jp/ chu/resource/wiki\_zh\_ja.tgz.

Tasks

Machine Translation Sentence Translation

Constructing a Chinese---Japanese Parallel Corpus from Wikipedia

Abstract

Tasks

Reproductions