ModelScope Text-to-Video Technical Report

2023-08-12Code Available3· sign in to hype

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang

Code Available — Be the first to reproduce this paper.

Code

github.com/exponentialml/text-to-video-finetuning
OfficialIn paperpytorch★ 697
github.com/picsart-ai-research/streamingt2v
pytorch★ 1,629
github.com/yhZhai/mcm
pytorch★ 70

Abstract

This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at https://modelscope.cn/models/damo/text-to-video-synthesis/summary.

Tasks

Denoising Image Generation Text-to-Video Generation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
MSR-VTT	ModelScopeT2V	FVD	550	—	Unverified

ModelScope Text-to-Video Technical Report

Code

Abstract

Tasks

Benchmark Results

Reproductions