Delay-Tolerant Local SGD for Efficient Distributed Training

2021-01-01Unverified0· sign in to hype

An Xu, Xiao Yan, Hongchang Gao, Heng Huang

Unverified — Be the first to reproduce this paper.

Abstract

The heavy communication for model synchronization is a major bottleneck for scaling up the distributed deep neural network training to many workers. Moreover, model synchronization can suffer from long delays in scenarios such as federated learning and geo-distributed training. Thus, it is crucial that the distributed training methods are both delay-tolerant AND communication-efficient. However, existing works cannot simultaneously address the communication delay and bandwidth constraint. To address this important and challenging problem, we propose a novel training framework OLCO3 to achieve delay tolerance with a low communication budget by using stale information. OLCO3 introduces novel staleness compensation and compression compensation to combat the influence of staleness and compression error. Theoretical analysis shows that OLCO3 achieves the same sub-linear convergence rate as the vanilla synchronous stochastic gradient descent (SGD) method. Extensive experiments on deep learning tasks verify the effectiveness of OLCO3 and its advantages over existing works.

Tasks

Federated Learning

Delay-Tolerant Local SGD for Efficient Distributed Training

Abstract

Tasks

Reproductions