Exploring Curvature Noise in Large-Batch Stochastic Optimization

2018-09-27Unverified0· sign in to hype

Yeming Wen, Kevin Luk, Maxime Gazeau, Guodong Zhang, Harris Chan, Jimmy Ba

Unverified — Be the first to reproduce this paper.

Abstract

Using stochastic gradient descent (SGD) with large batch-sizes to train deep neural networks is an increasingly popular technique. By doing so, one can improve parallelization by scaling to multiple workers (GPUs) and hence leading to significant reductions in training time. Unfortunately, a major drawback is the so-called generalization gap: large-batch training typically leads to a degradation in generalization performance of the model as compared to small-batch training. In this paper, we propose to correct this generalization gap by adding diagonal Fisher curvature noise to large-batch gradient updates. We provide a theoretical analysis of our method in the convex quadratic setting. Our empirical study with state-of-the-art deep learning models shows that our method not only improves the generalization performance in large-batch training but furthermore, does so in a way where the training convergence remains desirable and the training duration is not elongated. We additionally connect our method to recent works on loss surface landscape in the experimental section.

Tasks

Stochastic Optimization

Exploring Curvature Noise in Large-Batch Stochastic Optimization

Abstract

Tasks

Reproductions