SOTAVerified

Creating a massively parallel Bible corpus

2014-05-01LREC 2014Unverified0· sign in to hype

Thomas Mayer, Michael Cysouw

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

We present our ongoing effort to create a massively parallel Bible corpus. While an ever-increasing number of Bible translations is available in electronic form on the internet, there is no large-scale parallel Bible corpus that allows language researchers to easily get access to the texts and their parallel structure for a large variety of different languages. We report on the current status of the corpus, with over 900 translations in more than 830 language varieties. All translations are tokenized (e.g., separating punctuation marks) and Unicode normalized. Mainly due to copyright restrictions only portions of the texts are made publicly available. However, we provide co-occurrence information for each translation in a (sparse) matrix format. All word forms in the translation are given together with their frequency and the verses in which they occur.

Tasks

Reproductions