An Analysis of D^α seeding for k-means
Etienne Bamas, Sai Ganesh Nagarajan, Ola Svensson
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
One of the most popular clustering algorithms is the celebrated D^ seeding algorithm (also know as k-means++ when =2) by Arthur and Vassilvitskii (2007), who showed that it guarantees in expectation an O(2^2 k)-approximate solution to the (k,)-means cost (where euclidean distances are raised to the power ) for any 1. More recently, Balcan, Dick, and White (2018) observed experimentally that using D^ seeding with >2 can lead to a better solution with respect to the standard k-means objective (i.e. the (k,2)-means cost). In this paper, we provide a rigorous understanding of this phenomenon. For any >2, we show that D^ seeding guarantees in expectation an approximation factor of with respect to the standard k-means cost of any underlying clustering; where g_ is a parameter capturing the concentration of the points in each cluster, _max and _min are the maximum and minimum standard deviation of the clusters around their means, and is the number of distinct mixing weights in the underlying clustering (after rounding them to the nearest power of 2). We complement these results by some lower bounds showing that the dependency on g_ and _max/_min is tight. Finally, we provide an experimental confirmation of the effects of the aforementioned parameters when using D^ seeding. Further, we corroborate the observation that >2 can indeed improve the k-means cost compared to D^2 seeding, and that this advantage remains even if we run Lloyd's algorithm after the seeding.