Software Framework for Topic Modelling with Large Corpora

2010-01-01Workshop On New Challenges For NLP Frameworks 2010Code Available0· sign in to hype

Radim Řehůřek, Petr Sojka

Code Available — Be the first to reproduce this paper.

Code

github.com/RaRe-Technologies/gensim
none★ 0

Abstract

Large corpora are ubiquitous in today’s world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). In this paper, we identify a gap in existing implementations of many of the popular algorithms, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. Within this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design, so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ. 1.

Tasks

Topic Models

Software Framework for Topic Modelling with Large Corpora

Code

Abstract

Tasks

Reproductions