Inference economics of language models

2025-06-05Code Available0· sign in to hype

Ege Erdil

Code Available — Be the first to reproduce this paper.

Code

github.com/ege-erdil/inference-economics
OfficialIn papernone★ 8

Abstract

We develop a theoretical model that addresses the economic trade-off between cost per token versus serial token generation speed when deploying LLMs for inference at scale. Our model takes into account arithmetic, memory bandwidth, network bandwidth and latency constraints; and optimizes over different parallelism setups and batch sizes to find the ones that optimize serial inference speed at a given cost per token. We use the model to compute Pareto frontiers of serial speed versus cost per token for popular language models.

Inference economics of language models

Code

Abstract

Reproductions