SOTAVerified

EM Corpus: a comparable corpus for a less-resourced language pair Manipuri-English

2021-09-01RANLP (BUCC) 2021Unverified0· sign in to hype

Rudali Huidrom, Yves Lepage, Khogendra Khomdram

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

In this paper, we introduce a sentence-level comparable text corpus crawled and created for the less-resourced language pair, Manipuri(mni) and English (eng). Our monolingual corpora comprise 1.88 million Manipuri sentences and 1.45 million English sentences, and our parallel corpus comprises 124,975 Manipuri-English sentence pairs. These data were crawled and collected over a year from August 2020 to March 2021 from a local newspaper website called ‘The Sangai Express.’ The resources reported in this paper are made available to help the low-resourced languages community for MT/NLP tasks.

Tasks

Reproductions