Using Word Familiarities and Word Associations to Measure Corpus Representativeness

2014-05-01LREC 2014Unverified0· sign in to hype

Reinhard Rapp

Unverified — Be the first to reproduce this paper.

Abstract

The definition of corpus representativeness used here assumes that a representative corpus should reflect as well as possible the average language use a native speaker encounters in everyday life over a longer period of time. As it is not practical to observe people's language input over years, we suggest to utilize two types of experimental data capturing two forms of human intuitions: Word familiarity norms and word association norms. If it is true that human language acquisition is corpus-based, such data should reflect people's perceived language input. Assuming so, we compute a representativeness score for a corpus by extracting word frequency and word association statistics from it and by comparing these statistics to the human data. The higher the similarity, the more representative the corpus should be for the language environments of the test persons. We present results for five different corpora and for truncated versions thereof. The results confirm the expectation that corpus size and corpus balance are crucial aspects for corpus representativeness.

Tasks

Language Acquisition

Using Word Familiarities and Word Associations to Measure Corpus Representativeness

Abstract

Tasks

Reproductions