Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving

2025-03-31Code Available1· sign in to hype

Wei Gao, Xinyu Zhou, Peng Sun, Tianwei Zhang, Yonggang Wen

Code Available — Be the first to reproduce this paper.

Code

github.com/llmkvsys/rethink-kv-compression
OfficialIn paperpytorch★ 23

Abstract

Key-Value cache (KV cache) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving. It primarily decreases the memory consumption of KV cache to reduce the computation cost. Despite the development of many compression algorithms, their applications in production environments are still not prevalent. In this paper, we revisit mainstream KV cache compression solutions from a practical perspective. Our contributions are three-fold. First, we comprehensively review existing algorithmic designs and benchmark studies for KV cache compression and identify missing pieces in their performance measurement, which could hinder their adoption in practice. Second, we empirically evaluate representative KV cache compression methods to uncover two key issues that affect the computational efficiency: (1) while compressing KV cache can reduce memory consumption, current implementations (e.g., FlashAttention, PagedAttention) do not optimize for production-level LLM serving, resulting in suboptimal throughput performance; (2) compressing KV cache may lead to longer outputs, resulting in increased end-to-end latency. We further investigate the accuracy performance of individual samples rather than the overall performance, revealing the intrinsic limitations in KV cache compression when handling specific LLM tasks. Third, we provide tools to shed light on future KV cache compression studies and facilitate their practical deployment in production. They are open-sourced in https://github.com/LLMkvsys/rethink-kv-compression.

Tasks

Computational Efficiency Language Modeling Language Modelling Large Language Model

Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving

Code

Abstract

Tasks

Reproductions