Compressing Large Language Models using Low Rank and Low Precision Decomposition
Rajarshi Saha, Naomi Sagan, Varun Srivastava, Andrea J. Goldsmith, Mert Pilanci
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/pilancilab/calderaOfficialIn paperpytorch★ 106
Abstract
The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces CALDERA -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix W by approximating it via a low-rank, low-precision decomposition as W Q + LR. Here, L and R are low rank factors, and the entries of Q, L and R are quantized. The model is compressed by substituting each layer with its Q + LR decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, L and R are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. CALDERA obtains this decomposition by formulating it as an optimization problem _Q,L,R(Q + LR - W)X^_ F^2, where X is the calibration data, and Q, L, R are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of CALDERA are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-2 7B/13B/70B and LlaMa-3 8B models using CALDERA outperforms existing post-training LLM compression techniques in the regime of less than 2.5 bits per parameter. The implementation is available at: https://github.com/pilancilab/caldera.