Boosting Latent Diffusion Models via Disentangled Representation Alignment

2026-03-16Unverified0· sign in to hype

John Page, Xuesong Niu, Kai Wu, Kun Gai

Unverified — Be the first to reproduce this paper.

Abstract

Latent Diffusion Models (LDMs) rely heavily on the compressed latent space provided by Variational Autoencoders (VAEs) for high-quality image generation. Recent studies have attempted to obtain generation-friendly VAEs by directly adopting alignment strategies from LDM training, leveraging Vision Foundation Models (VFMs) as representation alignment targets. However, such alignment paradigms overlook the fundamental differences in representational requirements between LDMs and VAEs. Simple feature mapping from local patches to high-dimensional semantics can induce semantic collapse, leading to the loss of fine-grained attributes. In this paper, we reveal a key insight: unlike LDMs that benefit from high-level global semantics, a generation-friendly VAE must possess strong semantic disentanglement capabilities to preserve fine-grained, attribute-level information in a structured manner. To address this discrepancy, we propose the Semantic-Disentangled VAE (Send-VAE). Deviating from previous shallow alignment approaches, Send-VAE introduces a non-linear mapping architecture to effectively bridge the local structures of VAEs and the dense semantics of VFMs, thereby encouraging emergent disentangled properties in the latent space without explicit regularization. Extensive experiments establish a new paradigm for evaluating VAE latent spaces via low-level attribute separability and demonstrate that Send-VAE achieves state-of-the-art generation quality (FID of 1.21) on ImageNet 256x256.

Boosting Latent Diffusion Models via Disentangled Representation Alignment

Abstract

Reproductions