Heavy hitters via cluster-preserving clustering
Kasper Green Larsen, Jelani Nelson, Huy L. Nguyen, Mikkel Thorup
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
In turnstile _p -heavy hitters, one maintains a high-dimensional xR^n subject to update(i,) causing x_i x_i + , where i[n], R. Upon receiving a query, the goal is to report a small list L[n], |L| = O(1/^p), containing every "heavy hitter" i[n] with |x_i| \|x_1/^p\|_p, where x_k denotes the vector obtained by zeroing out the largest k entries of x in magnitude. For any p(0,2] the CountSketch solves _p heavy hitters using O(^-p n) words of space with O( n) update time, O(n n) query time to output L, and whose output after any query is correct with high probability (whp) 1 - 1/poly(n). Unfortunately the query time is very slow. To remedy this, the work [CM05] proposed for p=1 in the strict turnstile model, a whp correct algorithm achieving suboptimal space O(^-1^2 n), worse update time O(^2 n), but much better query time O(^-1poly( n)). We show this tradeoff between space and update time versus query time is unnecessary. We provide a new algorithm, ExpanderSketch, which in the most general turnstile model achieves optimal O(^-p n) space, O( n) update time, and fast O(^-ppoly( n)) query time, and whp correctness. Our main innovation is an efficient reduction from the heavy hitters to a clustering problem in which each heavy hitter is encoded as some form of noisy spectral cluster in a much bigger graph, and the goal is to identify every cluster. Since every heavy hitter must be found, correctness requires that every cluster be found. We then develop a "cluster-preserving clustering" algorithm, partitioning the graph into clusters without destroying any original cluster.