The simpler the better: vanilla sgd revisited

2021-01-01Unverified0· sign in to hype

Yueyao Yu, Jie Wang, Wenye Li, Yin Zhang

Unverified — Be the first to reproduce this paper.

Abstract

The stochastic gradient descent (SGD) method, first proposed in 1950's, has been the foundation for deep-neural-network (DNN) training with numerous enhancements including adding a momentum or adaptively selecting learning rates, or using both strategies and more. A conventional wisdom for SGD is that the learning rate must be eventually made small in order to reach sufficiently good approximate solutions. Another widely held view is that the vanilla SGD is out of fashion in comparison to many of its modern variations. In this work, we provide a contrarian claim that, when training over-parameterized DNNs, the vanilla SGD can still compete well with, and oftentimes outperform, its more recent variations by simply using learning rates significantly larger than commonly used values. We provide some theoretical justifications to this claim, and also present computational evidence, across multiple tasks including image classification, speech recognition and natural language processing, in support of this claim.

Tasks

image-classification Image Classification speech-recognition Speech Recognition

The simpler the better: vanilla sgd revisited

Abstract

Tasks

Reproductions