EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

2024-02-06Code Available0· sign in to hype

Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Xinlong Wang

Code Available — Be the first to reproduce this paper.

Code

github.com/baaivision/EVA/tree/master/EVA-CLIP-18B
Officialpytorch★ 0

Abstract

Scaling up contrastive language-image pretraining (CLIP) is critical for empowering both vision and multimodal models. We present EVA-CLIP-18B, the largest and most powerful open-source CLIP model to date, with 18-billion parameters. With only 6-billion training samples seen, EVA-CLIP-18B achieves an exceptional 80.7% zero-shot top-1 accuracy averaged across 27 widely recognized image classification benchmarks, outperforming its forerunner EVA-CLIP (5-billion parameters) and other open-source CLIP models by a large margin. Remarkably, we observe a consistent performance improvement with the model size scaling of EVA-CLIP, despite maintaining a constant training dataset of 2-billion image-text pairs from LAION-2B and COYO-700M. This dataset is openly available and much smaller than the in-house datasets (e.g., DFN-5B, WebLI-10B) employed in other state-of-the-art CLIP models. EVA-CLIP-18B demonstrates the potential of EVA-style weak-to-strong visual model scaling. With our model weights made publicly available, we hope to facilitate future research in vision and multimodal foundation models.

Tasks

image-classification Image Classification Zero-Shot Transfer Image Classification

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Food-101	EVA-CLIP-18B	Top 1 Accuracy	95.8	—	Unverified
ImageNet	EVA-CLIP-18B	Accuracy (Private)	83.8	—	Unverified
ImageNet-A	EVA-CLIP-18B	Accuracy (Private)	87.3	—	Unverified
ImageNet-R	EVA-CLIP-18B	Accuracy	95.7	—	Unverified
ImageNet-Sketch	EVA-CLIP-18B	Accuracy (Private)	74.7	—	Unverified
ImageNet V2	EVA-CLIP-18B	Accuracy (Private)	77.9	—	Unverified
ObjectNet	EVA-CLIP-18B	Accuracy (Private)	82.2	—	Unverified
SUN	EVA-CLIP-18B	Accuracy	77.7	—	Unverified

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Code

Abstract

Tasks

Benchmark Results

Reproductions