Coresets for Multiple _p Regression
David P. Woodruff, Taisuke Yasuda
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
A coreset of a dataset with n examples and d features is a weighted subset of examples that is sufficient for solving downstream data analytic tasks. Nearly optimal constructions of coresets for least squares and _p linear regression with a single response are known in prior work. However, for multiple _p regression where there can be m responses, there are no known constructions with size sublinear in m. In this work, we construct coresets of size O(^-2d) for p<2 and O(^-pd^p/2) for p>2 independently of m (i.e., dimension-free) that approximate the multiple _p regression objective at every point in the domain up to (1) relative error. If we only need to preserve the minimizer subject to a subspace constraint, we improve these bounds by an factor for all p>1. All of our bounds are nearly tight. We give two application of our results. First, we settle the number of uniform samples needed to approximate _p Euclidean power means up to a (1+) factor, showing that (^-2) samples for p = 1, (^-1) samples for 1 < p < 2, and (^1-p) samples for p>2 is tight, answering a question of Cohen-Addad, Saulpic, and Schwiegelshohn. Second, we show that for 1<p<2, every matrix has a subset of O(^-1k) rows which spans a (1+)-approximately optimal k-dimensional subspace for _p subspace approximation, which is also nearly optimal.