Multimodal Autoregressive Pre-training of Large Vision Encoders

2024-11-21CVPR 2025Code Available5· sign in to hype

Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/apple/ml-aim
OfficialIn paperjax★ 1,411

Abstract

We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.

Tasks

Decoder Image Classification

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ImageNet	AIMv2-L	Top 1 Accuracy	86.6	—	Unverified
ImageNet	AIMv2-3B (448 res)	Top 1 Accuracy	89.5	—	Unverified
ImageNet	AIMv2-3B	Top 1 Accuracy	88.5	—	Unverified
ImageNet	AIMv2-1B	Top 1 Accuracy	88.1	—	Unverified
ImageNet	AIMv2-H	Top 1 Accuracy	87.5	—	Unverified
iNaturalist	AIMv2-3B (448 res)	Top 1 Accuracy	85.9	—	Unverified
iNaturalist	AIMv2-3B	Top 1 Accuracy	81.5	—	Unverified
iNaturalist	AIMv2-1B	Top 1 Accuracy	79.7	—	Unverified
iNaturalist	AIMv2-H	Top 1 Accuracy	77.9	—	Unverified
iNaturalist	AIMv2-L	Top 1 Accuracy	76	—	Unverified

Multimodal Autoregressive Pre-training of Large Vision Encoders

Code

Abstract

Tasks

Benchmark Results

Reproductions