SOTAVerified

Testing with Non-identically Distributed Samples

2023-11-19Unverified0· sign in to hype

Shivam Garg, Chirag Pabbaraju, Kirankumar Shiragur, Gregory Valiant

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

We examine the extent to which sublinear-sample property testing and estimation applies to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size k, p_1, p_2,,p_T, and we obtain c independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, p_avg. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with c=1 samples from each distribution, (k/^2) samples are necessary and sufficient to learn p_avg to within error in TV distance. To test uniformity or identity -- distinguishing the case that p_avg is equal to some reference distribution, versus has _1 distance at least from the reference distribution, we show that a linear number of samples in k is necessary given c=1 samples from each distribution. In contrast, for c 2, we recover the usual sublinear sample testing of the i.i.d. setting: we show that O(k/^2 + 1/^4) samples are sufficient, matching the optimal sample complexity in the i.i.d. case in the regime where k^-1/4. Additionally, we show that in the c=2 case, there is a constant > 0 such that even in the linear regime with k samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same p_i) can perform uniformity testing.

Tasks

Reproductions