SOTAVerified

Facilitating Corpus Usage: Making Icelandic Corpora More Accessible for Researchers and Language Users

2020-05-01LREC 2020Unverified0· sign in to hype

Stein{\th}{\'o}r Steingr{\'\i}msson, Starka{\dh}ur Barkarson, Gunnar Thor {\"O}rn{\'o}lfsson

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

We introduce an array of open and accessible tools to facilitate the use of the Icelandic Gigaword Corpus, in the field of Natural Language Processing as well as for students, linguists, sociologists and others benefitting from using large corpora. A KWIC engine, powered by the Swedish Korp tool is adapted to the specifics of the corpus. An n-gram viewer, highly customizable to suit different needs, allows users to study word usage throughout the period of our text collection. A frequency dictionary provides much sought after information about word frequency statistics, computed for each subcorpus as well as aggregate, disambiguating homographs based on their respective lemmas and morphosyntactic tags. Furthermore, we provide n-grams based on the corpus, and a variety of pre-trained word embeddings models, based on word2vec, GloVe, fastText and ELMo. For three of the model types, multiple word embedding models are available trained with different algorithms and using either lemmatised or unlemmatised texts.

Tasks

Reproductions