Can LLMs Maintain Fundamental Abilities under KV Cache Compression?
Xiang Liu, Zhenheng Tang, Hong Chen, Peijie Dong, Zeyu Li, Xiuze Zhou, Bo Li, Xuming Hu, Xiaowen Chu
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
This paper investigates an underexplored challenge in large language models (LLMs): the impact of KV cache compression methods on LLMs' fundamental capabilities. Although existing methods achieve impressive compression ratios on long-context benchmarks, their effects on core model capabilities remain understudied. We present a comprehensive benchmark KVFundaBench to systematically evaluate the effects of KV cache compression across diverse fundamental LLM capabilities, spanning world knowledge, commonsense reasoning, arithmetic reasoning, code generation, safety, and long-context understanding and generation.Our analysis reveals serval key findings: (1) Task-Dependent Degradation; (2) Model-Type Robustness (3) Prompt Length Vulnerability; (4) Chunk-Level Superiority; (5) Prompt-Gain Sensitivity; (6) Long-Context Generation Sensitivity. Based on our analysis of attention patterns and cross-task compression performance, we propose ShotKV, a novel compression approach that distinctly handles prefill and decoding phases while maintaining shot-level semantic coherence. Empirical results show that ShotKV achieves 9\%-18\% performance improvements on long-context generation tasks under aggressive compression ratios.