Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, William W. Cohen
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/zihangdai/mosOfficialIn paperpytorch★ 0
- github.com/yfreedomliTHU/mos-pytorch1.1pytorch★ 3
- github.com/cstorm125/thai2fitpytorch★ 0
- github.com/nunezpaul/MNISTtf★ 0
- github.com/zhangyaoyuan/GAN-Simplificationtf★ 0
- github.com/nkcr/overlap-mlpytorch★ 0
- github.com/omerlux/Recurrent_Neural_Network_-_Part_2tf★ 0
- github.com/tdmeeste/SparseSeqModelspytorch★ 0
- github.com/omerlux/NLP-PTBpytorch★ 0
Abstract
We formulate language modeling as a matrix factorization problem, and show that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck. Given that natural language is highly context-dependent, this further implies that in practice Softmax with distributed word embeddings does not have enough capacity to model natural language. We propose a simple and effective method to address this issue, and improve the state-of-the-art perplexities on Penn Treebank and WikiText-2 to 47.69 and 40.68 respectively. The proposed method also excels on the large-scale 1B Word dataset, outperforming the baseline by over 5.6 points in perplexity.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| Penn Treebank (Word Level) | AWD-LSTM-MoS + dynamic eval | Test perplexity | 47.69 | — | Unverified |
| Penn Treebank (Word Level) | AWD-LSTM-MoS | Test perplexity | 54.44 | — | Unverified |
| WikiText-2 | AWD-LSTM-MoS + dynamic eval | Test perplexity | 40.68 | — | Unverified |
| WikiText-2 | AWD-LSTM-MoS | Test perplexity | 61.45 | — | Unverified |