SOTAVerified

Deep Delta Learning

2026-01-29Code Available3· sign in to hype

Yifan Zhang, Yifeng Liu, Mengdi Wang, Quanquan Gu

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

The effectiveness of deep residual networks hinges on the identity shortcut connection. While this mechanism alleviates the vanishing-gradient problem, it also has a strictly additive inductive bias on feature transformations, limiting the network's ability to model complex hidden state transitions. In this paper, we introduce Deep Delta Learning (DDL), which generalizes the shortcut from a fixed identity map to a learnable, state-dependent linear operator. The resulting Delta Operator is a rank-1 perturbation of the identity, A(X) = I- β(X)k (X) k (X)^, parameterized by a unit direction k(X) and a scalar gate β(X). We provide a spectral analysis showing that β(X) continuously interpolates the shortcut between identity (β=0), orthogonal projection (β=1), and Householder reflection (β=2). Furthermore, we rewrite the residual update as a synchronized rank-1 delta write: β scales both the removal of the current k-component and the injection of the new k-component. This unification enables explicit control of the shortcut spectrum along a data-dependent direction while retaining stable training behavior. Empirically, replacing Transformer residual additions with DDL improves validation loss and perplexity, as well as downstream evaluation accuracy on language modeling tasks, with larger gains in the expanded-state setting.

Reproductions