Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, YuChao Gu, Yufei Shi, Wynne Hsu, Ying Shan, XiaoHu Qie, Mike Zheng Shou
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/showlab/Tune-A-VideoOfficialpytorch★ 4,379
- github.com/xuduo35/MakeLongVideopytorch★ 82
- github.com/showlab/loveu-tgve-2023pytorch★ 78
Abstract
To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation settingx2014One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure guidance for sampling. Extensive qualitative and numerical experiments demonstrate the remarkable ability of our method across various applications.