On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer
Ruihan Xu, Jiajin Li, Yiping Lu
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
A central question in modern deep learning is how to design optimizers whose behavior remains stable as the network width w increases. We address this question by interpreting several widely used neural-network optimizers, including AdamW and Muon, as instances of steepest descent under matrix operator norms. This perspective links optimizer geometry with the Lipschitz structure of the network forward map, and enables width-independent control of both Lipschitz and smoothness constants. However, steepest-descent rules induced by standard p q operator norms lack layerwise composability and therefore cannot provide width-independent bounds in deep architectures. We overcome this limitation by introducing a family of mean-normalized operator norms, denoted , that admit layerwise composability, yield width-independent smoothness bounds, and give rise to practical optimizers such as rescaled AdamW, row normalization, and column normalization. The resulting learning rate width-aware scaling rules recover μP scaling~yang2021tensor as a special case and provide a principled mechanism for cross-width learning-rate transfer across a broad class of optimizers. We further show that Muon can suffer an O(w) worst-case growth in the smoothness constant, whereas a new family of row-normalized optimizers we propose achieves width-independent smoothness guarantees. Based on the observations, we propose MOGA (Matrix Operator Geometry Aware), a width-aware optimizer based only on row/column-wise normalization that enables stable learning-rate transfer across model widths. Large-scale pre-training on GPT-2 and LLaMA shows that MOGA, especially with row normalization, is competitive with Muon while being notably faster in large-token and low-loss regimes.