SOTAVerified

An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams

2006-01-01INFORMS 2006Unverified0· sign in to hype

Tamraparni Dasu, Shankar Krishnan, Suresh Venkatasubramanian, Ke Yi

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

An important problem in processing large data streams is detecting changes in the underlying distribution that generates the data. The challenge in designing change detection schemes is making them general, scalable, and statistically sound. In this paper, we take a general, information-theoretic approach to the change detection problem, which works for multidimensional as well as categorical data. We use relative entropy, also called the Kullback-Leibler distance, to measure the difference between two given distributions. The KL-distance is known to be related to the optimal error in determining whether the two distributions are the same and draws on fundamental results in hypothesis testing. The KL-distance also generalizes traditional distance measures in statistics, and has invariance properties that make it ideally suited for comparing distributions. Our scheme is general; it is nonparametric and requires no assumptions on the underlying distributions. It employs a statistical inference procedure based on the theory of bootstrapping, which allows us to determine whether our measurements are statistically significant. The scheme is also quite flexible from a practical perspective; it can be implemented using any spatial partitioning scheme that scales well with dimensionality. In addition to providing change detections, our method generalizes Kulldorff’s spatial scan statistic, allowing us to quantitatively identify specific regions in space where large changes have occurred. We provide a detailed experimental study that demonstrates the generality and efficiency of our approach with different kinds of multidimensional datasets, both synthetic and real.

Tasks

Reproductions