SOTAVerified

Distributed Distributional Similarities of Google Books Over the Centuries

2014-05-01LREC 2014Unverified0· sign in to hype

Martin Riedl, Richard Steuer, Chris Biemann

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

This paper introduces a distributional thesaurus and sense clusters computed on the complete Google Syntactic N-grams, which is extracted from Google Books, a very large corpus of digitized books published between 1520 and 2008. We show that a thesaurus computed on such a large text basis leads to much better results than using smaller corpora like Wikipedia. We also provide distributional thesauri for equal-sized time slices of the corpus. While distributional thesauri can be used as lexical resources in NLP tasks, comparing word similarities over time can unveil sense change of terms across different decades or centuries, and can serve as a resource for diachronic lexicography. Thesauri and clusters are available for download.

Tasks

Reproductions