SOTAVerified

Explaining Vision-Language Similarities in Dual Encoders with Feature-Pair Attributions

2024-08-26Code Available0· sign in to hype

Lucas Möller, Pascal Tilli, Ngoc Thang Vu, Sebastian Padó

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Dual encoder architectures like CLIP models map two types of inputs into a shared embedding space and learn similarities between them. However, it is not understood how such models compare two inputs. Here, we address this research gap with two contributions. First, we derive a method to attribute predictions of any differentiable dual encoder onto feature-pair interactions between its inputs. Second, we apply our method to CLIP-type models and show that they learn fine-grained correspondences between parts of captions and regions in images. They match objects across input modes and also account for mismatches. However, this visual-linguistic grounding ability heavily varies between object classes, depends on the training data distribution, and largely improves after in-domain training. Using our method we can identify knowledge gaps about specific object classes in individual models and can monitor their improvement upon fine-tuning.

Tasks

Reproductions