SOTAVerified

Toward a Comparable Corpus of Latvian, Russian and English Tweets

2017-08-01WS 2017Unverified0· sign in to hype

Dmitrijs Milajevs

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Twitter has become a rich source for linguistic data. Here, a possibility of building a trilingual Latvian-Russian-English corpus of tweets from Riga, Latvia is investigated. Such a corpus, once constructed, might be of great use for multiple purposes including training machine translation models, examining cross-lingual phenomena and studying the population of Riga. This pilot study shows that it is feasible to build such a resource by collecting and analysing a pilot corpus, which is made publicly available and can be used to construct a large comparable corpus.

Tasks

Reproductions