ADOPT: Modified Adam Can Converge with Any β_2 with the Optimal Rate

2024-11-05Code Available3· sign in to hype

Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong Cheol Jeong, Go Nagahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, Yutaka Matsuo

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/ishohei220/adopt
OfficialIn paperpytorch★ 435
github.com/huggingface/pytorch-image-models
pytorch★ 36,538

Abstract

Adam is one of the most popular optimization algorithms in deep learning. However, it is known that Adam does not converge in theory unless choosing a hyperparameter, i.e., _2, in a problem-dependent manner. There have been many attempts to fix the non-convergence (e.g., AMSGrad), but they require an impractical assumption that the gradient noise is uniformly bounded. In this paper, we propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of O ( 1 / T ) with any choice of _2 without depending on the bounded noise assumption. ADOPT addresses the non-convergence issue of Adam by removing the current gradient from the second moment estimate and changing the order of the momentum update and the normalization by the second moment estimate. We also conduct intensive numerical experiments, and verify that our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning. The implementation is available at https://github.com/iShohei220/adopt.

Tasks

Deep Reinforcement Learning image-classification Image Classification

ADOPT: Modified Adam Can Converge with Any β_2 with the Optimal Rate

Code

Abstract

Tasks

Reproductions