Spectral Edge Dynamics of Training Trajectories: Signal--Noise Geometry Across Scales
Yongzhong Xu
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Despite hundreds of millions of parameters, transformer training trajectories evolve within only a few coherent directions. We introduce Spectral Edge Dynamics (SED) to quantify this structure: a rolling-window SVD of parameter updates reveals a sharp boundary -- the spectral edge -- between coherent optimization directions and stochastic noise, identified via the maximum consecutive singular value ratio σ_k / σ_k+1. Across a 51M-parameter TinyStories model (4 seeds) and GPT-2 124M under distribution shift, the spectral edge exhibits a universal three-phase pattern (rise, plateau, collapse). The effective signal rank adapts to task complexity (k^* = 2 at 51M, k^* = 3 at 124M), and the directional coupling between spectral geometry and validation loss reverses with window size -- a lag flip reflecting the timescale of trajectory integration. Johnson--Lindenstrauss projection to d = 10W dimensions (e.g., d = 100 for W = 10) preserves the spectral gap within 5.7\%, making the framework applicable to models of arbitrary scale. In companion work, the same spectral geometry provides early-warning signals of grokking -- predicting generalization 600--1,700 steps before it occurs across modular arithmetic, Dyck languages, and the SCAN benchmark.