Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding

2025-03-28Unverified0· sign in to hype

Aayush Gautam, Susav Shrestha, Narasimha Reddy

Unverified — Be the first to reproduce this paper.

Abstract

Speculative decoding accelerates large language model (LLM) inference by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, selecting an optimal speculation length is critical for maximizing speedup while minimizing wasted computation. We introduce GammaTune and GammaTune+, training-free adaptive algorithms that dynamically adjust speculation length based on token acceptance rates using a heuristic-based switching mechanism. Evaluated on SpecBench across multiple tasks and model pairs, our method outperforms other heuristic-based approaches and fixed-length speculative decoding, achieving an average speedup of 15\% (5\%) with GammaTune and 16\% (3\%) with GammaTune+, while reducing performance variance. This makes GammaTune a robust and efficient solution for real-world deployment.

Tasks

Language Modeling Language Modelling Large Language Model

Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding

Abstract

Tasks

Reproductions