Vector Quantized Diffusion Model for Text-to-Image Synthesis
Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/cientgu/vq-diffusionOfficialIn paperpytorch★ 486
- github.com/microsoft/vq-diffusionIn paperpytorch★ 978
Abstract
We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| COCO (Common Objects in Context) | VQ-Diffusion-F | FID | 13.86 | — | Unverified |
| COCO (Common Objects in Context) | VQ-Diffusion-B | FID | 19.75 | — | Unverified |
| CUB | VQ-Diffusion-S | FID | 12.97 | — | Unverified |
| CUB | VQ-Diffusion-F | FID | 10.32 | — | Unverified |
| CUB | VQ-Diffusion-B | FID | 11.94 | — | Unverified |
| Oxford 102 Flowers | VQ-Diffusion-F | FID | 14.1 | — | Unverified |
| Oxford 102 Flowers | VQ-Diffusion-B | FID | 14.88 | — | Unverified |
| Oxford 102 Flowers | VQ-Diffusion-S | FID | 14.95 | — | Unverified |