PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

2024-03-07Code Available5· sign in to hype

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li

Code Available — Be the first to reproduce this paper.

Code

github.com/PixArt-alpha/PixArt-sigma
Officialpytorch★ 1,909

Abstract

In this paper, we introduce PixArt- , a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt- represents a significant advancement over its predecessor, PixArt- , offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt- is its training efficiency. Leveraging the foundational pre-training of PixArt- , it evolves from the `weaker' baseline to a `stronger' model via incorporating higher quality data, a process we term "weak-to-strong training". The advancements in PixArt- are twofold: (1) High-Quality Training Data: PixArt- incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt- achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt- 's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.

Tasks

4k Image Captioning Image Generation Text to Image Generation Text-to-Image Generation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
TextAtlasEval	PixArt-Sigma	TextVsionBlend OCR (F1 Score)	1.57	—	Unverified

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Code

Abstract

Tasks

Benchmark Results

Reproductions