Ensemble of Averages: Improving Model Selection and Boosting Performance in Domain Generalization

2021-10-21Code Available1· sign in to hype

Devansh Arpit, Huan Wang, Yingbo Zhou, Caiming Xiong

Code Available — Be the first to reproduce this paper.

Code

github.com/salesforce/ensemble-of-averages
OfficialIn paperpytorch★ 31

Abstract

In Domain Generalization (DG) settings, models trained independently on a given set of training domains have notoriously chaotic performance on distribution shifted test domains, and stochasticity in optimization (e.g. seed) plays a big role. This makes deep learning models unreliable in real world settings. We first show that this chaotic behavior exists even along the training optimization trajectory of a single model, and propose a simple model averaging protocol that both significantly boosts domain generalization and diminishes the impact of stochasticity by improving the rank correlation between the in-domain validation accuracy and out-domain test accuracy, which is crucial for reliable early stopping. Taking advantage of our observation, we show that instead of ensembling unaveraged models (that is typical in practice), ensembling moving average models (EoA) from independent runs further boosts performance. We theoretically explain the boost in performance of ensembling and model averaging by adapting the well known Bias-Variance trade-off to the domain generalization setting. On the DomainBed benchmark, when using a pre-trained ResNet-50, this ensemble of averages achieves an average of 68.0\%, beating vanilla ERM (w/o averaging/ensembling) by 4\%, and when using a pre-trained RegNetY-16GF, achieves an average of 76.6\%, beating vanilla ERM by 6\%. Our code is available at https://github.com/salesforce/ensemble-of-averages.

Tasks

Domain Generalization Model Selection

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
DomainNet	Ensemble of Averages (RegNetY-16GF)	Average Accuracy	60.9	—	Unverified
DomainNet	Ensemble of Averages (ResNeXt-50 32x4d)	Average Accuracy	54.6	—	Unverified
DomainNet	Ensemble of Averages (ResNet-50)	Average Accuracy	47.4	—	Unverified
Office-Home	Ensemble of Averages (RegNetY-16GF)	Average Accuracy	83.9	—	Unverified
Office-Home	Ensemble of Averages (ResNeXt-50 32x4d)	Average Accuracy	80.2	—	Unverified
Office-Home	Ensemble of Averages (ResNet-50)	Average Accuracy	72.5	—	Unverified
PACS	Ensemble of Averages (RegNetY-16GF)	Average Accuracy	95.8	—	Unverified
PACS	Ensemble of Averages (ResNeXt-50 32x4d)	Average Accuracy	93.2	—	Unverified
PACS	Ensemble of Averages (ResNet-50)	Average Accuracy	88.6	—	Unverified
TerraIncognita	Ensemble of Averages (RegNetY-16GF)	Average Accuracy	61.1	—	Unverified
TerraIncognita	Ensemble of Averages (ResNeXt-50 32x4d)	Average Accuracy	55.2	—	Unverified
TerraIncognita	Ensemble of Averages (ResNet-50)	Average Accuracy	52.3	—	Unverified
VLCS	Ensemble of Averages (RegNetY-16GF)	Average Accuracy	81.1	—	Unverified
VLCS	Ensemble of Averages (ResNeXt-50 32x4d)	Average Accuracy	80.4	—	Unverified
VLCS	Ensemble of Averages (ResNet-50)	Average Accuracy	79.1	—	Unverified

Ensemble of Averages: Improving Model Selection and Boosting Performance in Domain Generalization

Code

Abstract

Tasks

Benchmark Results

Reproductions