Can We Learn Communication-Efficient Optimizers?

2023-12-02Code Available0· sign in to hype

Charles-Étienne Joseph, Benjamin Thérien, Abhinav Moudgil, Boris Knyazev, Eugene Belilovsky

Code Available — Be the first to reproduce this paper.

Code

github.com/lefameuxbeding/learned_aggregation
OfficialIn papernone★ 5

Abstract

Communication-efficient variants of SGD, specifically local SGD, have received a great deal of interest in recent years. These approaches compute multiple gradient steps locally, that is on each worker, before averaging model parameters, helping relieve the critical communication bottleneck in distributed deep learning training. Although many variants of these approaches have been proposed, they can sometimes lag behind state-of-the-art adaptive optimizers for deep learning. In this work, we investigate if the recent progress in the emerging area of learned optimizers can potentially close this gap while remaining communication-efficient. Specifically, we meta-learn how to perform global updates given an update from local SGD iterations. Our results demonstrate that learned optimizers can substantially outperform local SGD and its sophisticated variants while maintaining their communication efficiency. Learned optimizers can even generalize to unseen and much larger datasets and architectures, including ImageNet and ViTs, and to unseen modalities such as language modeling. We therefore demonstrate the potential of learned optimizers for improving communication-efficient distributed learning.

Tasks

Language Modeling Language Modelling

Can We Learn Communication-Efficient Optimizers?

Code

Abstract

Tasks

Reproductions