SOTAVerified

A Balanced Data Approach for Evaluating Cross-Lingual Transfer: Mapping the Linguistic Blood Bank

2022-01-16ACL ARR January 2022Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

We show that the choice of pretraining languages affects downstream cross-lingual transfer for BERT based models. We inspect zero-shot performance under balanced data conditions to mitigate data size confounds, classifying pretrain languages that increase downstream performance into donors, and languages that are most improved in zero-shot performance as recipients. We develop a method of quadratic time complexity in the number of pretraining languages to estimate these inter-language relations, instead of an exponential exhaustive computation of all possible combinations. We find that our method is effective on a diverse set of languages spanning different linguistic features and two downstream tasks.Our findings can inform developers of future large scale multilingual language models in choosing better pretraining configurations.

Tasks

Reproductions