Latent Video Diffusion Models for High-Fidelity Long Video Generation

2022-11-23Code Available2· sign in to hype

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, Qifeng Chen

Code Available — Be the first to reproduce this paper.

Code

github.com/yingqinghe/lvdm
OfficialIn paperpytorch★ 504

Abstract

AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.

Tasks

Denoising Image Generation Text-to-Video Generation Video Generation Vocal Bursts Intensity Prediction

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Sky Time-lapse	MoCoGAN-HD (128x128)	FVD 16	183.6	—	Unverified
Sky Time-lapse	TATS (128x128)	FVD 16	132.6	—	Unverified
Sky Time-lapse	Long-video GAN (128x128)	FVD 16	107.5	—	Unverified
Sky Time-lapse	Long-video GAN (256x256)	FVD 16	116.5	—	Unverified
Sky Time-lapse	LVDM (256x256)	FVD 16	95.2	—	Unverified
Sky Time-lapse	DIGAN (128x128)	FVD 16	114.6	—	Unverified
Taichi	DIGAN (128x128)	FVD16	128.1	—	Unverified
Taichi	DIGAN (256x256)	FVD16	156.7	—	Unverified
Taichi	TATS (128x128)	FVD16	94.6	—	Unverified
Taichi	LVDM (256x256)	FVD16	99	—	Unverified
Taichi	MoCoGAN-HD (128x128)	FVD16	144.7	—	Unverified
UCF-101	LVDM (256x256, unconditional)	FVD16	372	—	Unverified
UCF-101	MCVD	FVD16	2,460	—	Unverified
UCF-101	VDM	FVD16	1,396	—	Unverified
UCF-101	TGAN-v2 (128x128)	FVD16	1,209	—	Unverified
UCF-101	LVDM (256x256, unconditional)	FVD16	552	—	Unverified

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Code

Abstract

Tasks

Benchmark Results

Reproductions