SOTAVerified

Monolingual corpus creation and evaluation of truly low-resource languages from Peru

2020-07-01WS 2020Unverified0· sign in to hype

Gina Bustamante, Arturo Oncevay

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for corpus creation considers multiple filtering steps, and focuses on educational PDF documents. Throughout an evaluation based on language modelling and character-level perplexity, we determine that our method allows the creation of clean monolingual corpora to support further Natural Language Processing (NLP) tasks in four languages.

Tasks

Reproductions