Omni-Masked Gradient Descent: Memory-Efficient Optimization via Mask Traversal with Improved Convergence

2026-03-10Unverified0· sign in to hype

Hui Yang, Tao Ren, Jinyang Jiang, Wan Tian, Yijie Peng

Unverified — Be the first to reproduce this paper.

Abstract

Memory-efficient optimization methods have recently gained increasing attention for scaling full-parameter training of large language models under the GPU-memory bottleneck. Existing approaches either lack clear convergence guarantees, or only achieve the standard O(ε^-4) iteration complexity in the nonconvex settings. We propose Omni-Masked Gradient Descent (OMGD), an optimization method based on mask traversal for memory efficient training, and provide a nonconvex convergence analysis that establishes a strictly improved iteration complexity of O(ε^-3) for finding an ε-approximate stationary point. Empirically, OMGD is a lightweight, plug-and-play approach that integrates seamlessly into most mainstream optimizers, yielding consistent improvements over competitive baselines in both fine-tuning and pre-training tasks.

Omni-Masked Gradient Descent: Memory-Efficient Optimization via Mask Traversal with Improved Convergence

Abstract

Reproductions