FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation

2025-02-03Code Available1· sign in to hype

Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim

Code Available — Be the first to reproduce this paper.

Code

github.com/dongwonjo/fastkv
OfficialIn paperpytorch★ 30

Abstract

While large language models (LLMs) excel at handling long-context sequences, they require substantial key-value (KV) caches to store contextual information, which can heavily burden computational efficiency and memory usage. Previous efforts to compress these KV caches primarily focused on reducing memory demands but were limited in enhancing latency. To address this issue, we introduce FastKV, a KV cache compression method designed to reduce latency for long-context inference. FastKV improves processing speed while preserving accuracy by adopting Token-Selective Propagation (TSP). This approach preserves full-context information in early layers of LLMs and selectively propagates only a portion of this information in later layers. This design enables FastKV to minimize redundant computation without sacrificing contextual fidelity. Our experimental results show that FastKV achieves up to 1.97 and 4.82 improvements in time-to-first-token (TTFT) and throughput, respectively, compared to baseline without KV cache compression. Moreover, FastKV successfully maintains accuracy within 1\% of the baseline on long-context benchmarks. Our code is available at https://github.com/dongwonjo/FastKV.

Tasks

Computational Efficiency

FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation

Code

Abstract

Tasks

Reproductions