Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence

2020-04-08ACL 2021Code Available2· sign in to hype

Federico Bianchi, Silvia Terragni, Dirk Hovy

Code Available — Be the first to reproduce this paper.

Code

github.com/MilaNLProc/contextualized-topic-models
OfficialIn paperpytorch★ 1,266
github.com/d2klab/tomodapi
none★ 38
github.com/aaronmueller/contextualized-topic-models
pytorch★ 7

Abstract

Topic models extract groups of words from documents, whose interpretation as a topic hopefully allows for a better understanding of the data. However, the resulting word groups are often not coherent, making them harder to interpret. Recently, neural topic models have shown improvements in overall coherence. Concurrently, contextual embeddings have advanced the state of the art of neural models in general. In this paper, we combine contextualized representations with neural topic models. We find that our approach produces more meaningful and coherent topics than traditional bag-of-words topic models and recent neural models. Our results indicate that future improvements in language models will translate into better topic models.

Tasks

Sentence Embeddings Topic Models Variational Inference

Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence

Code

Abstract

Tasks

Reproductions