Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

2025-03-12Code Available4· sign in to hype

Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, Volodymyr Kuleshov

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/kuleshov-group/bd3lms
OfficialIn paperpytorch★ 979
github.com/MindSpore-scientific/code-12/tree/main/Block_Model
mindspore★ 0

Abstract

Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences. We provide the code, along with the model weights and blog post on the project page: https://m-arriola.com/bd3lms

Tasks

Denoising Language Modeling Language Modelling

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
OpenWebText	BD3-LMs	eval_perplexity	20.73	—	Unverified

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Code

Abstract

Tasks

Benchmark Results

Reproductions