The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

2025-02-26Unverified0· sign in to hype

Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, Weinan E, Lei Wu

Unverified — Be the first to reproduce this paper.

Abstract

Transformers consist of diverse building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feedforward networks. Thus, understanding the differences and interactions among these blocks is important. In this paper, we uncover a clear Sharpness Disparity across these blocks, which emerges early in training and intriguingly persists throughout the training process. Motivated by this finding, we propose Blockwise Learning Rate (LR), a strategy that tailors the LR to each block's sharpness, accelerating large language model (LLM) pre-training. By integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly 2 speedup compared to vanilla AdamW. We demonstrate this acceleration across GPT-2 and LLaMA, with model sizes ranging from 0.12B to 2B and datasets of OpenWebText, MiniPile, and C4. Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined 2 speedup and 2 memory saving. These results underscore the potential of exploiting the sharpness disparity to improve LLM training.

Tasks

Language Modeling Language Modelling Large Language Model

The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Abstract

Tasks

Reproductions