E-MD3C: Taming Masked Diffusion Transformers for Efficient Zero-Shot Object Customization

2025-02-13Unverified0· sign in to hype

Trung X. Pham, Zhang Kang, Ji Woo Hong, Xuran Zheng, Chang D. Yoo

Unverified — Be the first to reproduce this paper.

Abstract

We propose E-MD3C (Efficient Masked Diffusion Transformer with Disentangled Conditions and Compact Collector), a highly efficient framework for zero-shot object image customization. Unlike prior works reliant on resource-intensive Unet architectures, our approach employs lightweight masked diffusion transformers operating on latent patches, offering significantly improved computational efficiency. The framework integrates three core components: (1) an efficient masked diffusion transformer for processing autoencoder latents, (2) a disentangled condition design that ensures compactness while preserving background alignment and fine details, and (3) a learnable Conditions Collector that consolidates multiple inputs into a compact representation for efficient denoising and learning. E-MD3C outperforms the existing approach on the VITON-HD dataset across metrics such as PSNR, FID, SSIM, and LPIPS, demonstrating clear advantages in parameters, memory efficiency, and inference speed. With only 14 of the parameters, our Transformer-based 468M model delivers 2.5 faster inference and uses 23 of the GPU memory compared to an 1720M Unet-based latent diffusion model.

Tasks

Computational Efficiency Denoising GPU SSIM

E-MD3C: Taming Masked Diffusion Transformers for Efficient Zero-Shot Object Customization

Abstract

Tasks

Reproductions