MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention

2025-05-24Code Available1· sign in to hype

Can Yaras, Alec S. Xu, Pierre Abillama, Changwoo Lee, Laura Balzano

Code Available — Be the first to reproduce this paper.

Code

github.com/cjyaras/monarch-attention
OfficialIn paperpytorch★ 23

Abstract

Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention -- a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with (NN d) computational complexity and (Nd) memory/IO complexity. Unlike previous approaches, MonarchAttention is both (1) transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the transformer, and (2) hardware-efficient, utilizing the highest-throughput tensor core units on modern GPUs. With optimized kernels, MonarchAttention achieves substantial speed-ups in wall-time over FlashAttention-2: 1.4 for shorter sequences (N=256), 4.5 for medium-length sequences (N=4K), and 8.2 for longer sequences (N=16K). We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts. Our code is available at https://github.com/cjyaras/monarch-attention.

Tasks

16k 4k

MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention

Code

Abstract

Tasks

Reproductions