Select Via Proxy: Efficient Data Selection For Training Deep Networks

2019-05-01ICLR 2019Unverified0· sign in to hype

Cody Coleman, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, Matei Zaharia

Unverified — Be the first to reproduce this paper.

Abstract

At internet scale, applications collect a tremendous amount of data by logging user events, analyzing text, and collecting images. This data powers a variety of machine learning models for tasks such as image classification, language modeling, content recommendation, and advertising. However, training large models over all available data can be computationally expensive, creating a bottleneck in the development of new machine learning models. In this work, we develop a novel approach to efficiently select a subset of training data to achieve faster training with no loss in model predictive performance. In our approach, we first train a small proxy model quickly, which we then use to estimate the utility of individual training data points, and then select the most informative ones for training the large target model. Extensive experiments show that our approach leads to a 1.6x and 1.8x speed-up on CIFAR10 and SVHN by selecting 60% and 50% subsets of the data, while maintaining the predictive performance of the model trained on the entire dataset.

Tasks

BIG-bench Machine Learning image-classification Image Classification Language Modeling Language Modelling

Select Via Proxy: Efficient Data Selection For Training Deep Networks

Abstract

Tasks

Reproductions