Distributed Representations of Sentences and Documents

2014-05-16Code Available0· sign in to hype

Quoc V. Le, Tomas Mikolov

Code Available — Be the first to reproduce this paper.

Code

github.com/bombdiggity/paper-bag
tf★ 1
github.com/jimmy6727/Informd
tf★ 0
github.com/TheCyberian/windowsMalwareDetectionWithNLP
none★ 0
github.com/julian-risch/ICADL2018
tf★ 0
github.com/hithisisdhara/doc2vec
pytorch★ 0
github.com/inejc/paragraph-vectors
pytorch★ 0
github.com/kr900910/supreme_court_opinion
tf★ 0
github.com/fabiocorreacordeiro/Elsevier_abstracts-Classification
none★ 0
github.com/tsandefer/capstone_2
tf★ 0
github.com/DCYN/Ramdomized-Clinical-Trail-Classification
tf★ 0

Abstract

Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.

Tasks

Question Answering Sentiment Analysis Text Classification

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
QASent	Paragraph vector (lexical overlap + dist output)	MAP	0.68	—	Unverified
QASent	Paragraph vector	MAP	0.52	—	Unverified
WikiQA	Paragraph vector (lexical overlap + dist output)	MAP	0.6	—	Unverified
WikiQA	Paragraph vector	MAP	0.51	—	Unverified

Distributed Representations of Sentences and Documents

Code

Abstract

Tasks

Benchmark Results

Reproductions