Grammar induction from (lots of) words alone

2016-12-01COLING 2016Code Available0· sign in to hype

John K Pate, Mark Johnson

Code Available — Be the first to reproduce this paper.

Code

github.com/jkpate/streamingDMV
OfficialIn papernone★ 0

Abstract

Grammar induction is the task of learning syntactic structure in a setting where that structure is hidden. Grammar induction from words alone is interesting because it is similiar to the problem that a child learning a language faces. Previous work has typically assumed richer but cognitively implausible input, such as POS tag annotated data, which makes that work less relevant to human language acquisition. We show that grammar induction from words alone is in fact feasible when the model is provided with sufficient training data, and present two new streaming or mini-batch algorithms for PCFG inference that can learn from millions of words of training data. We compare the performance of these algorithms to a batch algorithm that learns from less data. The minibatch algorithms outperform the batch algorithm, showing that cheap inference with more data is better than intensive inference with less data. Additionally, we show that the harmonic initialiser, which previous work identified as essential when learning from small POS-tag annotated corpora (Klein and Manning, 2004), is not superior to a uniform initialisation.

Tasks

Language Acquisition POS TAG Topic Models

Grammar induction from (lots of) words alone

Code

Abstract

Tasks

Reproductions