SOTAVerified

CascadeV: An Implementation of Wurstchen Architecture for Video Generation

2025-01-28Code Available1· sign in to hype

Wenfeng Lin, Jiangchuan Wei, Boyuan Liu, Yichen Zhang, Shiyue Yan, Mingyu Guo

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Recently, with the tremendous success of diffusion models in the field of text-to-image (T2I) generation, increasing attention has been directed toward their potential in text-to-video (T2V) applications. However, the computational demands of diffusion models pose significant challenges, particularly in generating high-resolution videos with high frame rates. In this paper, we propose CascadeV, a cascaded latent diffusion model (LDM), that is capable of producing state-of-the-art 2K resolution videos. Experiments demonstrate that our cascaded model achieves a higher compression ratio, substantially reducing the computational challenges associated with high-quality video generation. We also implement a spatiotemporal alternating grid 3D attention mechanism, which effectively integrates spatial and temporal information, ensuring superior consistency across the generated video frames. Furthermore, our model can be cascaded with existing T2V models, theoretically enabling a 4 increase in resolution or frames per second without any fine-tuning. Our code is available at https://github.com/bytedance/CascadeV.

Tasks

Reproductions