Clustering Mixtures of Bounded Covariance Distributions Under Optimal Separation
Ilias Diakonikolas, Daniel M. Kane, Jasper C. H. Lee, Thanasis Pittas
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
We study the clustering problem for mixtures of bounded covariance distributions, under a fine-grained separation assumption. Specifically, given samples from a k-component mixture distribution D = _i =1^k w_i P_i, where each w_i for some known parameter , and each P_i has unknown covariance _i ^2_i I_d for some unknown _i, the goal is to cluster the samples assuming a pairwise mean separation in the order of (_i+_j)/ between every pair of components P_i and P_j. Our contributions are as follows: For the special case of nearly uniform mixtures, we give the first poly-time algorithm for this clustering task. Prior work either required separation scaling with the maximum cluster standard deviation (i.e. _i _i) [DKK+22b] or required both additional structural assumptions and mean separation scaling as a large degree polynomial in 1/ [BKK22]. For general-weight mixtures, we point out that accurate clustering is information-theoretically impossible under our fine-grained mean separation assumptions. We introduce the notion of a clustering refinement -- a list of not-too-small subsets satisfying a similar separation, and which can be merged into a clustering approximating the ground truth -- and show that it is possible to efficiently compute an accurate clustering refinement of the samples. Furthermore, under a variant of the "no large sub-cluster'' condition from in prior work [BKK22], we show that our algorithm outputs an accurate clustering, not just a refinement, even for general-weight mixtures. As a corollary, we obtain efficient clustering algorithms for mixtures of well-conditioned high-dimensional log-concave distributions. Moreover, our algorithm is robust to ()-fraction of adversarial outliers.