Exploring the Limits of Language Modeling
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/tensorflow/modelstf★ 77,694
- github.com/rdspring1/PyTorch_GBW_LMpytorch★ 123
- github.com/jmichaelov/does-surprisal-explain-n400pytorch★ 1
- github.com/UnofficialJuliaMirrorSnapshots/DeepMark-deepmarktorch★ 0
- github.com/rafaljozefowicz/lmtf★ 0
- github.com/tensorflow/models/tree/master/research/lm_1btf★ 0
- github.com/dmlc/gluon-nlpmxnet★ 0
- github.com/okuchaiev/f-lmtf★ 0
- github.com/DeepMark/deepmarktorch★ 0
- github.com/UnofficialJuliaMirror/DeepMark-deepmarktorch★ 0
Abstract
In this work we explore recent advances in Recurrent Neural Networks for large scale Language Modeling, a task central to language understanding. We extend current models to deal with two key challenges present in this task: corpora and vocabulary sizes, and complex, long term structure of language. We perform an exhaustive study on techniques such as character Convolutional Neural Networks or Long-Short Term Memory, on the One Billion Word Benchmark. Our best single model significantly improves state-of-the-art perplexity from 51.3 down to 30.0 (whilst reducing the number of parameters by a factor of 20), while an ensemble of models sets a new record by improving perplexity from 41.0 down to 23.7. We also release these models for the NLP and ML community to study and improve upon.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| One Billion Word | 10 LSTM+CNN inputs + SNM10-SKIP (ensemble) | PPL | 23.7 | — | Unverified |
| One Billion Word | LSTM-8192-1024 + CNN Input | PPL | 30 | — | Unverified |
| One Billion Word | LSTM-8192-1024 | PPL | 30.6 | — | Unverified |