Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs
Nir Ailon, Akhiad Bercovich, Omri Weinstein
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
We propose a cheaper alternative bilinear operator to matrix-multiplication in deep neural networks (DNNs). Unlike many stubborn attempts to accelerate MatMuls in DNN inference, this operator is supported by capabilities of existing GPU hardware, most notably NVIDIA TensorCores. To our knowledge, this is the first GPU-native acceleration technique which does not decrease (in fact, increases) the number of trainable parameters of the network, mitigating the accuracy-loss of compression-based techniques. Hence, this operator is at the same time more expressive than MatMul, yet requires substantially fewer FLOPs to evaluate. We term this new operator Strassen-Tile (STL). The main idea behind STL(X,W) is a local change-of-basis (learnable encoder) on weights and activation tiles, after which we perform batched elementwise products between tiles, and a final decoding transformation (inspired by algebraic pipelines from fast matrix and polynomial multiplication). We compare STL against two benchmarks. The first one is SoTA T2T-ViT on Imagenet-1K. Here we show that replacing all linear layers with STL and training from scratch, results in factor x2.7 reduction in FLOPs with a 0.5 accuracy improvement. Our second speed-accuracy comparison benchmark for pretrained LLMs is the most practical GPU-acceleration technique, structured Sparsity. Finetuning TinyLlama tinyllama24 with STL layers on the Slim Pajama dataset, achieves similar accuracy to 2:4, with x2.2 FLOP speedup compared to x1.7 of the latter. Finally, we discuss a group-theoretic approach for discovering universal encoders for STL, which could lead to fast black-box acceleration via approximate matrix-multiplication (AMM).