QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

2023-07-07Code Available1· sign in to hype

Tommaso Pegolotti, Elias Frantar, Dan Alistarh, Markus Püschel

Code Available — Be the first to reproduce this paper.

Code

github.com/ist-daslab/qigen
OfficialIn paperpytorch★ 28

Abstract

We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, including both hardware characteristics and method-specific accuracy constraints. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution. A preliminary implementation is available at https://github.com/IST-DASLab/QIGen.

Tasks

Code Generation CPU

QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

Code

Abstract

Tasks

Reproductions