SOTAVerified

Lookahead Optimizer: k steps forward, 1 step back

2019-07-19NeurIPS 2019Code Available1· sign in to hype

Michael R. Zhang, James Lucas, Geoffrey Hinton, Jimmy Ba

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of fast weights generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
CIFAR-10 ResNet-18 - 200 EpochsADAMAccuracy94.84Unverified
CIFAR-10 ResNet-18 - 200 EpochsLookaheadAccuracy95.27Unverified
CIFAR-10 ResNet-18 - 200 EpochsSGDAccuracy95.23Unverified
ImageNet ResNet-50 - 50 EpochsLookaheadTop 1 Accuracy75.13Unverified
ImageNet ResNet-50 - 50 EpochsSGDTop 5 Accuracy92.15Unverified
ImageNet ResNet-50 - 60 EpochsLookaheadTop 1 Accuracy75.49Unverified
ImageNet ResNet-50 - 60 EpochsSGDTop 1 Accuracy75.15Unverified

Reproductions