SOTAVerified

Enlarging Scarce In-domain English-Croatian Corpus for SMT of MOOCs Using Serbian

2016-12-01WS 2016Unverified0· sign in to hype

Maja Popovi{\'c}, Kostadin Cholakov, Valia Kordoni, Nikola Ljube{\v{s}}i{\'c}

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Massive Open Online Courses have been growing rapidly in size and impact. Yet the language barrier constitutes a major growth impediment in reaching out all people and educating all citizens. A vast majority of educational material is available only in English, and state-of-the-art machine translation systems still have not been tailored for this peculiar genre. In addition, a mere collection of appropriate in-domain training material is a challenging task. In this work, we investigate statistical machine translation of lecture subtitles from English into Croatian, which is morphologically rich and generally weakly supported, especially for the educational domain. We show that results comparable with publicly available systems trained on much larger data can be achieved if a small in-domain training set is used in combination with additional in-domain corpus originating from the closely related Serbian language.

Tasks

Reproductions