SOTAVerified

DartMinHash: Fast Sketching for Weighted Sets

2020-05-23Code Available1· sign in to hype

Tobias Christiani

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Weighted minwise hashing is a standard dimensionality reduction technique with applications to similarity search and large-scale kernel machines. We introduce a simple algorithm that takes a weighted set x R_ 0^d and computes k independent minhashes in expected time O(k k + x _0( x _1 + 1/ x _1)), improving upon the state-of-the-art BagMinHash algorithm (KDD '18) and representing the fastest weighted minhash algorithm for sparse data. Our experiments show running times that scale better with k and x _0 compared to ICWS (ICDM '10) and BagMinhash, obtaining 10x speedups in common use cases. Our approach also gives rise to a technique for computing fully independent locality-sensitive hash values for (L, K)-parameterized approximate near neighbor search under weighted Jaccard similarity in optimal expected time O(LK + x _0), improving on prior work even in the case of unweighted sets.

Tasks

Reproductions