CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation

2026-03-27Unverified0· sign in to hype

Jeannie Chung, Hanna Jang, Ingyeong Yang, Uiwon Hwang, Jaehyeong Sim

Unverified — Be the first to reproduce this paper.

Abstract

CLIP aligns image and text embeddings via contrastive learning and demonstrates strong zero-shot generalization. Its large-scale architecture requires substantial computational and memory resources, motivating the distillation of its capabilities into lightweight student models. However, existing CLIP distillation methods do not explicitly model multi-directional relational dependencies between teacher and student embeddings, limiting the student's ability to preserve the structural relationships encoded by the teacher. To address this, we propose a relational knowledge distillation framework that introduces two novel methods, Vertical Relational Distillation (VRD) and Cross Relational Distillation (XRD). VRD enforces consistency of teacher-student distillation strength across modalities at the distribution level, while XRD imposes bidirectional symmetry on cross-modal teacher-student similarity distributions. By jointly modeling multi-directional relational structures, CLIP-RD promotes faithful alignment of the student embedding geometry with that of the teacher, outperforming existing methods by 0.8%p.

CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation

Abstract

Reproductions