CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

2022-05-29Code Available6· sign in to hype

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang

Code Available — Be the first to reproduce this paper.

Code

github.com/thudm/cogvideo
OfficialIn paperpytorch★ 12,548

Abstract

Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.

Tasks

Text-to-Video Generation Video Generation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
UCF-101	CogVideo (128x128, class-conditional)	FVD16	305	—	Unverified

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Code

Abstract

Tasks

Benchmark Results

Reproductions