V_kD: Improving Knowledge Distillation using Orthogonal Projections
Roy Miles, Ismail Elezi, Jiankang Deng
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/roymiles/vkdOfficialIn paperpytorch★ 57
Abstract
Knowledge distillation is an effective method for training small and efficient deep learning models. However, the efficacy of a single method can degenerate when transferring to other tasks, modalities, or even other architectures. To address this limitation, we propose a novel constrained feature distillation method. This method is derived from a small set of core principles, which results in two emerging components: an orthogonal projection and a task-specific normalisation. Equipped with both of these components, our transformer models can outperform all previous methods on ImageNet and reach up to a 4.4% relative improvement over the previous state-of-the-art methods. To further demonstrate the generality of our method, we apply it to object detection and image generation, whereby we obtain consistent and substantial performance improvements over state-of-the-art. Code and models are publicly available: https://github.com/roymiles/vkd
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| ImageNet | VkD (T:RegNety 160 S:DeiT-S) | Top-1 accuracy % | 82.9 | — | Unverified |
| ImageNet | VkD (T:RegNety 160 S:DeiT-Ti) | Top-1 accuracy % | 79.2 | — | Unverified |