Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

2022-03-10Code Available2· sign in to hype

Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, Ludwig Schmidt

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/mlfoundations/model-soups
OfficialIn paperpytorch★ 510
github.com/Burf/ModelSoups
tf★ 50
github.com/shallowlearn/sportsreid
pytorch★ 22
github.com/flowritecom/flow-merge
pytorch★ 20
github.com/hwk0702/keras2torch/tree/main/Computer_Vision/Model_Soup
pytorch★ 0

Abstract

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups." When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at https://github.com/mlfoundations/model-soups.

Tasks

Domain Generalization image-classification Image Classification Out-of-Distribution Generalization Unsupervised Domain Adaptation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ImageNet-A	Model soups (ViT-G/14)	Top-1 accuracy %	92.67	—	Unverified
ImageNet-A	Model soups (BASIC-L)	Top-1 accuracy %	94.17	—	Unverified
ImageNet-R	Model soups (BASIC-L)	Top-1 Error Rate	3.9	—	Unverified
ImageNet-R	Model soups (ViT-G/14)	Top-1 Error Rate	4.54	—	Unverified
ImageNet-Sketch	Model soups (ViT-G/14)	Top-1 accuracy	74.24	—	Unverified
ImageNet-Sketch	Model soups (BASIC-L)	Top-1 accuracy	77.18	—	Unverified

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Code

Abstract

Tasks

Benchmark Results

Reproductions