u-μP: The Unit-Scaled Maximal Update Parametrization

2024-07-24Code Available2· sign in to hype

Charlie Blake, Constantin Eichenberg, Josef Dean, Lukas Balles, Luke Y. Prince, Björn Deiseroth, Andres Felipe Cruz-Salinas, Carlo Luschi, Samuel Weinbach, Douglas Orr

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/graphcore-research/unit-scaling
OfficialIn paperpytorch★ 133

Abstract

The Maximal Update Parametrization (P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-P, which improves upon P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-P models reaching a loss that is equal to or lower than comparable P models and working out-of-the-box in FP8.

u-μP: The Unit-Scaled Maximal Update Parametrization

Code

Abstract

Reproductions