Implicit Bias and Fast Convergence Rates for Self-attention

2024-02-08Unverified0· sign in to hype

Bhavya Vasudeva, Puneesh Deora, Christos Thrampoulidis

Unverified — Be the first to reproduce this paper.

Abstract

We study the fundamental optimization principles of self-attention, the defining mechanism of transformers, by analyzing the implicit bias of gradient-based optimizers in training a self-attention layer with a linear decoder in binary classification. Building on prior studies in linear logistic regression, recent findings demonstrate that the key-query matrix W_t from gradient-descent (GD) converges in direction towards W_mm, which maximizes the margin between optimal and non-optimal tokens across sequences. However, this convergence is local, dependent on initial conditions, only holds asymptotically as the number of iterations increases, and leaves questions about the potential benefits of adaptive step-size rules unaddressed. To bridge this gap, we first establish scenarios for which convergence is provably global. We then analyze two adaptive step-size strategies: normalized GD and Polyak step-size, demonstrating finite-time convergence rates for W_t to W_mm, and quantifying the sparsification rate of the attention map. These findings not only show that these strategies can accelerate parameter convergence over standard GD in a non-convex setting but also deepen the understanding of the implicit bias in self-attention, linking it more closely to the phenomena observed in linear logistic regression despite its intricate non-convex nature.

Tasks

Binary Classification Decoder regression

Implicit Bias and Fast Convergence Rates for Self-attention

Abstract

Tasks

Reproductions