ModelScope Text-to-Video Technical Report
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/exponentialml/text-to-video-finetuningOfficialIn paperpytorch★ 697
- github.com/ali-vilab/VGenpytorch★ 3,152
- github.com/picsart-ai-research/streamingt2vpytorch★ 1,629
- github.com/yhZhai/mcmpytorch★ 70
- github.com/ali-vilab/i2vgen-xlpytorch★ 8
Abstract
This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at https://modelscope.cn/models/damo/text-to-video-synthesis/summary.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| MSR-VTT | ModelScopeT2V | FVD | 550 | — | Unverified |