Learning Mixtures of Arbitrary Distributions over Large Discrete Domains

2012-12-07Unverified0· sign in to hype

Yuval Rabani, Leonard Schulman, Chaitanya Swamy

Unverified — Be the first to reproduce this paper.

Abstract

We give an algorithm for learning a mixture of unstructured distributions. This problem arises in various unsupervised learning scenarios, for example in learning topic models from a corpus of documents spanning several topics. We show how to learn the constituents of a mixture of k arbitrary distributions over a large discrete domain [n]=\1,2,,n\ and the mixture weights, using O(n n) samples. (In the topic-model learning setting, the mixture constituents correspond to the topic distributions.) This task is information-theoretically impossible for k>1 under the usual sampling process from a mixture distribution. However, there are situations (such as the above-mentioned topic model case) in which each sample point consists of several observations from the same mixture constituent. This number of observations, which we call the "sampling aperture", is a crucial parameter of the problem. We obtain the first bounds for this mixture-learning problem without imposing any assumptions on the mixture constituents. We show that efficient learning is possible exactly at the information-theoretically least-possible aperture of 2k-1. Thus, we achieve near-optimal dependence on n and optimal aperture. While the sample-size required by our algorithm depends exponentially on k, we prove that such a dependence is unavoidable when one considers general mixtures. A sequence of tools contribute to the algorithm, such as concentration results for random matrices, dimension reduction, moment estimations, and sensitivity analysis.

Tasks

2k Dimensionality Reduction Topic Models

Learning Mixtures of Arbitrary Distributions over Large Discrete Domains

Abstract

Tasks

Reproductions