LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models

2025-03-18Code Available2· sign in to hype

Yu Cheng, Fajie Yuan

Code Available — Be the first to reproduce this paper.

Code

github.com/westlake-repl/leanvae
OfficialIn paperpytorch★ 86

Abstract

Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent space. However, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduces two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE's superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video VAEs. Our model offers up to 50x fewer FLOPs and 44x faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video generation. Our models and code are available at https://github.com/westlake-repl/LeanVAE

Tasks

compressed sensing Video Generation Video Reconstruction

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Sky Time-lapse	Latte + LeanVAE	FVD 16	49.59	—	Unverified
UCF-101	Latte + LeanVAE	FVD16	164.45	—	Unverified

LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models

Code

Abstract

Tasks

Benchmark Results

Reproductions