SpectralKD: A Unified Framework for Interpreting and Distilling Vision Transformers via Spectral Analysis

2024-12-26Code Available0· sign in to hype

Huiyuan Tian, Bonan Xu, Shijian Li, Gang Pan

Code Available — Be the first to reproduce this paper.

Code

github.com/thy960112/SpectralKD
OfficialIn paperpytorch★ 1

Abstract

Knowledge Distillation (KD) has achieved widespread success in compressing large Vision Transformers (ViTs), but a unified theoretical framework for both ViTs and KD is still lacking. In this paper, we propose SpectralKD, a novel unified analytical framework that offers deeper insights into ViTs and optimizes KD via spectral analysis. Our model-wise analysis reveals that CaiT concentrates information in their first and last few layers, informing optimal layer selection for KD. Surprisingly, our layer-wise analysis discovers that Swin Transformer and CaiT exhibit similar spectral encoding patterns despite their architectural differences, leading to feature map alignment guideline. Building on these insights, we propose a simple yet effective spectral alignment method for KD. Benefiting from the deeper understanding by above analysis results, even such a simple strategy achieves state-of-the-art performance on ImageNet-1K without introducing any trainable parameters, improving DeiT-Tiny by +5.2\% and Swin-Tiny by +1.4\% in top-1 accuracy. Furthermore, our post-training analysis reveals that distilled students can reproduce spectral patterns similar to their teachers, opening a new area we term ``distillation dynamics". Code and experimental logs are available in https://github.com/thy960112/SpectralKD.

Tasks

Knowledge Distillation Transfer Learning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ImageNet	SpectralKD (T:Swin-S S:Swin-T)	Top-1 accuracy %	82.7	—	Unverified
ImageNet	SpectralKD (T:Cait-S24 S:DeiT-S)	Top-1 accuracy %	82.2	—	Unverified
ImageNet	SpectralKD (T:Cait-S24 S:DeiT-T)	Top-1 accuracy %	77.4	—	Unverified

SpectralKD: A Unified Framework for Interpreting and Distilling Vision Transformers via Spectral Analysis

Code

Abstract

Tasks

Benchmark Results

Reproductions