SOTAVerified

Self-transcendence: Is External Feature Guidance Indispensable for Accelerating Diffusion Transformer Training?

2026-03-15Code Available0· sign in to hype

Lingchen Sun, Rongyuan Wu, Zhengqiang Zhang, Ruibin Li, Yujing Sun, Shuaizheng Liu, Lei Zhang

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Recent works such as REPA have shown that guiding diffusion models with external semantic features (e.g., DINO) can significantly accelerate the training of diffusion transformers (DiTs). However, the use of pretrained external features as guidance signals introduces additional dependencies. We argue that DiTs actually have the power to guide the training of themselves, and propose SelfTranscendence, an effective method that achieves fast convergence using internal feature supervision only. The desired internal guidance features should meet two requirements: structurally clean to help shallow blocks separate noise from signal, and semantically discriminative to help shallow layers learn effective representations. With this consideration, we first align the DiT features with the clean VAE latent features, a native component of latent diffusion, for a short training phase (e.g., 40 epochs) to improve their structural representations, then apply the classifier-free guidance to the intermediate features, enhancing their discriminative capability and semantic expressiveness. These enriched internal features, learned entirely within the model, are used as supervision signals to guide a new DiT training from scratch. Compared to existing self-contained methods, our approach achieves a significant performance boost. It can even surpass REPA, which uses the external DINO features as guidance, in both generation quality and convergence speed for both class-to-image and text-to-image generation tasks. The source code of our method can be found at https://github.com/csslc/Self-Transcendence.

Reproductions