FLARE: Fast Low-rank Attention Routing Engine
Vedant Puri, Aditya Joglekar, Sri Datta Ganesh Bandreddi, Kevin Ferguson, Yu-hsuan Chen, Yongjie Jessica Zhang, Levent Burak Kara
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/vpuri3/flare.pyOfficialIn paper★ 14
Abstract
The quadratic complexity of self-attention limits the scalability of transformers on long sequences. We introduce Fast Low-rank Attention Routing Engine (FLARE), a token-mixing operator that realizes low-rank attention by routing information through a small set of latent tokens. Each layer induces an input-input token mixing matrix of rank at most M via a minimal encode-decode factorization implemented using only two standard scaled dot-product attention (SDPA) calls. Because the dominant O(NM) computation is expressed purely in terms of standard SDPA, FLARE is compatible with fused attention kernels and avoids materializing M N projection matrices. FLARE further assigns disjoint latent slices to each attention head, yielding a mixture of head-specific low-rank pathways. Empirically, FLARE scales to one-million-point unstructured meshes on a single GPU, achieves state-of-the-art accuracy on PDE surrogate benchmarks, and outperforms general-purpose efficient-attention methods on the Long Range Arena suite. We additionally release a large-scale additive manufacturing benchmark dataset. Our code is available at https://github.com/vpuri3/FLARE.py.