SOTAVerified

New language resources for the Pashto language

2012-05-01LREC 2012Unverified0· sign in to hype

Djamel Mostefa, Khalid Choukri, Sylvie Brunessaux, Karim Boudahmane

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

This paper reports on the development of new language resources for the Pashto language, a very low-resource language spoken in Afghanistan and Pakistan. In the scope of a multilingual data collection project, three large corpora are collected for Pashto. Firstly a monolingual text corpus of 100 million words is produced. Secondly a 100 hours speech database is recorded and manually transcribed. Finally a bilingual Pashto-French parallel corpus of around 2 million is produced by translating Pashto texts into French. These resources will be used to develop Human Language Technology systems for Pashto with a special focus on Machine Translation.

Tasks

Reproductions