ImageBind: One Embedding Space To Bind Them All

2023-05-09CVPR 2023Code Available5· sign in to hype

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

Code Available — Be the first to reproduce this paper.

Code

github.com/facebookresearch/imagebind
OfficialIn paperpytorch★ 8,995
github.com/klemens-floege/oneprot
pytorch★ 21
github.com/ginihumer/amumo
jax★ 8

Abstract

We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.

Tasks

All Cross-Modal Retrieval Multimodal Deep Learning Retrieval Sound Prompted Semantic Segmentation Speech Prompted Semantic Segmentation Temporal Relation Extraction Zero-shot Audio Classification Zero-shot Classification (unified classes)Zero-Shot Environment Sound Classification Zero-Shot Learning Zero-shot Scene Classification (unified classes)Zero-shot Text to Audio Retrieval Zero-Shot Video Retrieval

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Vinoground	ImageBind	Text Score	9.4	—	Unverified

ImageBind: One Embedding Space To Bind Them All

Code

Abstract

Tasks

Benchmark Results

Reproductions