Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

2023-10-09Code Available4· sign in to hype

Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/jy0205/Pyramid-Flow
pytorch★ 3,173
github.com/bornfly-detachment/asymmetric_magvitv2
pytorch★ 151
github.com/lucidrains/magvit2-pytorch
pytorch★ 0

Abstract

While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.

Tasks

Action Recognition Image Generation Language Modeling Language Modelling Video Compression Video Generation Video Prediction

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ImageNet 256x256	MAGVIT-v2	FID	1.78	—	Unverified
ImageNet 256x256	MAGVIT-v2 (w/o guidance)	FID	3.65	—	Unverified
ImageNet 512x512	MAGVIT-v2	FID	1.91	—	Unverified
ImageNet 512x512	MAGVIT-v2 (w/o guidance)	FID	3.07	—	Unverified

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Code

Abstract

Tasks

Benchmark Results

Reproductions