Data movement limits to frontier model training

2024-11-02Unverified0· sign in to hype

Ege Erdil, David Schneider-Joseph

Unverified — Be the first to reproduce this paper.

Abstract

We present a theoretical model of distributed training, and use it to analyze how far dense and sparse training runs can be scaled. Under our baseline assumptions, given a three month training duration, data movement bottlenecks begin to significantly lower hardware utilization for training runs exceeding about 10^28 FLOP, two orders of magnitude above the largest training run to date, suggesting the arrival of fundamental barriers to scaling in three years given recent rates of growth. A training run exceeding about 10^31 FLOP is infeasible even at low utilization. However, more aggressive batch size scaling and/or shorter and fatter model shapes, if achievable, have the potential to permit much larger training runs.

Tasks

model

Data movement limits to frontier model training

Abstract

Tasks

Reproductions