SOTAVerified

Sets Clustering

2020-03-09ICML 2020Code Available0· sign in to hype

Ibrahim Jubran, Murad Tukan, Alaa Maalouf, Dan Feldman

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

The input to the sets-k-means problem is an integer k 1 and a set P= _1,,P_n\ of sets in R^d. The goal is to compute a set C of k centers (points) in R^d that minimizes the sum _P P _p P, c C\| p-c \|^2 of squared distances to these sets. An -core-set for this problem is a weighted subset of P that approximates this sum up to 1 factor, for every set C of k centers in R^d. We prove that such a core-set of O(^2n) sets always exists, and can be computed in O(nn) time, for every input P and every fixed d,k 1 and (0,1). The result easily generalized for any metric space, distances to the power of z>0, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS (1+ approximation) for the sets-k-means problem that takes time near linear in n. This is the first result even for sets-mean on the plane (k=1, d=2). Open source code and experimental results for document classification and facility locations are also provided.

Tasks

Reproductions