Glinthawk: A Two-Tiered Architecture for Offline LLM Inference

2025-01-20Code Available1· sign in to hype

Pouya Hamadanian, Sadjad Fouladi

Code Available — Be the first to reproduce this paper.

Code

github.com/microsoft/glinthawk
OfficialIn papernone★ 19

Abstract

We introduce Glinthawk, an architecture for offline Large Language Model (LLM) inference. By leveraging a two-tiered structure, Glinthawk optimizes the utilization of the high-end accelerators ("Tier 1") by offloading the attention mechanism to lower-end compute tier ("Tier 2"). This separation allows the memory demand of the attention, known as the key-value cache, to scale independently from the model weights, enabling larger batch sizes and more efficient accelerator usage. Prototyped with NVIDIA T4 GPUs and standard CPU VMs, Glinthawk improves throughput by 5.9 and reduces cost of generation by 2.8, compared to paged attention baselines. For long sequence lengths, it achieves 16.3 throughput improvement at 2.4 less cost. Our evaluation shows that this architecture can tolerate moderate network latency with minimal performance degradation, making it highly effective for latency-tolerant, throughput-focused applications such as batch processing. The prototype is publicly available at https://github.com/microsoft/glinthawk.

Tasks

CPU Language Modeling Language Modelling Large Language Model

Glinthawk: A Two-Tiered Architecture for Offline LLM Inference

Code

Abstract

Tasks

Reproductions