Predicting the impact of dataset composition on model performance

2021-01-01Unverified0· sign in to hype

Tatsunori Hashimoto

Unverified — Be the first to reproduce this paper.

Abstract

Real-world machine learning systems are often are trained using a mix of data sources with varying cost and quality. Understanding how the size and composition of a training dataset affect model performance is critical for advancing our understanding of generalization, as well as designing more effective data collection policies. We show that there is a simple, accurate way to predict the loss incurred by a model based on data size and composition. Our work expands recent observations of log-linear generalization error and uses this to cast model performance prediction as a learning problem. Using the theory of optimal experimental design, we derive a simple rational function approximation to generalization error that can be fitted using a few model training runs. Our approach achieves nearly exact (r^2>.93) predictions of model performance under substantial extrapolation in two different standard supervised learning tasks and is accurate (r^2 > .83) on more challenging machine translation and question answering tasks where baselines achieve worse-than-random performance.

Tasks

Experimental Design Machine Translation Question Answering Translation

Predicting the impact of dataset composition on model performance

Abstract

Tasks

Reproductions