AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference
Zhuomin He, Yizhen Yao, Pengfei Zuo, Bin Gao, Qinya Li, Zhenzhe Zheng, Fan Wu
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/asisys/adaskipOfficialIn paperpytorch★ 20
Abstract
Long-context large language models (LLMs) inference is increasingly critical, motivating a number of studies devoted to alleviating the substantial storage and computational costs in such scenarios. Layer-wise skipping methods are promising optimizations but rarely explored in long-context inference. We observe that existing layer-wise skipping strategies have several limitations when applied in long-context inference, including the inability to adapt to model and context variability, disregard for sublayer significance, and inapplicability for the prefilling phase. This paper proposes , an adaptive sublayer skipping method specifically designed for long-context inference. adaptively identifies less important layers by leveraging on-the-fly similarity information, enables sublayer-wise skipping, and accelerates both the prefilling and decoding phases. The effectiveness of is demonstrated through extensive experiments on various long-context benchmarks and models, showcasing its superior inference performance over existing baselines.