Speculative Ensemble: Fast Large Language Model Ensemble via Speculation

2025-02-01Code Available1· sign in to hype

Jiale Fu, Yuchu Jiang, Junkai Chen, Jiaming Fan, Xin Geng, Xu Yang

Code Available — Be the first to reproduce this paper.

Code

github.com/kamichanw/speculative-ensemble
OfficialIn paperpytorch★ 30

Abstract

Ensemble methods enhance Large Language Models (LLMs) by combining multiple models but suffer from high computational costs. In this paper, we introduce Speculative Ensemble, a novel framework that accelerates LLM ensembles without sacrificing performance, inspired by Speculative Decoding-where a small proposal model generates tokens sequentially, and a larger target model verifies them in parallel. Our approach builds on two key insights: (1) the verification distribution can be the ensemble distribution of both the proposal and target models, and (2) alternating each model as the proposer and verifier can further enhance efficiency. We generalize this method to ensembles with n models and theoretically prove that SE is never slower than a standard ensemble, typically achieving faster speed. Extensive experiments demonstrate speed improvements of 1.11x-2.23x over standard ensemble techniques without compromising generation quality. Our code is available at https://github.com/Kamichanw/Speculative-Ensemble/

Tasks

Language Modeling Language Modelling Large Language Model

Speculative Ensemble: Fast Large Language Model Ensemble via Speculation

Code

Abstract

Tasks

Reproductions