Regularizing and Optimizing LSTM Language Models
Stephen Merity, Nitish Shirish Keskar, Richard Socher
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/salesforce/awd-lstm-lmOfficialpytorch★ 0
- github.com/S-Abdelnabi/awtpytorch★ 54
- github.com/ahmetumutdurmus/awd-lstmpytorch★ 12
- github.com/jkkummerfeld/emnlp20lmpytorch★ 3
- github.com/Asteur/RERITES-AvgWeightDescentLSTM-PoetryGenerationpytorch★ 0
- github.com/alexandra-chron/wassa-2018pytorch★ 0
- github.com/llppff/ptb-lstmorqrnn-pytorchpytorch★ 0
- github.com/mnhng/hier-char-embpytorch★ 0
- github.com/BenjiKCF/AWD-LSTM-sentiment-classifierpytorch★ 0
- github.com/cstorm125/thai2fitpytorch★ 0
Abstract
Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve as a fundamental building block for many sequence learning tasks, including machine translation, language modeling, and question answering. In this paper, we consider the specific problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTM-based models. We propose the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization. Further, we introduce NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user. Using these and other regularization strategies, we achieve state-of-the-art word level perplexities on two data sets: 57.3 on Penn Treebank and 65.8 on WikiText-2. In exploring the effectiveness of a neural cache in conjunction with our proposed model, we achieve an even lower state-of-the-art perplexity of 52.8 on Penn Treebank and 52.0 on WikiText-2.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| Penn Treebank (Word Level) | AWD-LSTM + continuous cache pointer | Test perplexity | 52.8 | — | Unverified |
| Penn Treebank (Word Level) | AWD-LSTM | Test perplexity | 57.3 | — | Unverified |
| WikiText-2 | AWD-LSTM + continuous cache pointer | Test perplexity | 52 | — | Unverified |
| WikiText-2 | AWD-LSTM | Test perplexity | 65.8 | — | Unverified |