SOTAVerified

CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference

2025-11-16Code Available0· sign in to hype

Dong Liu, Yanxuan Yu, Ben Lengerich

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Large language models face significant computational bottlenecks during inference due to the expensive output layer computation over large vocabularies. We present CSV-Decode, a novel approach that uses geometric upper bounds to construct small sub-vocabularies for each decoding step, enabling efficient sparse computation while maintaining dual correctness guarantees: exact top-k certification and -certified softmax approximations. Our method clusters vocabulary embeddings offline and uses centroid-plus-radius bounds to identify which tokens can be safely omitted from computation. We provide a complete system implementation with sparse GEMV kernels, multi-GPU sharding, and CUDA Graph optimization. Experimental results demonstrate significant speedup over full vocabulary decoding while maintaining distributional guarantees and low fallback rates. Our code implementation available at https://github.com/FastLM/CSV-Decode.

Reproductions