SOTAVerified

Guarantees for Nonlinear Representation Learning: Non-identical Covariates, Dependent Data, Fewer Samples

2024-10-15Unverified0· sign in to hype

Thomas T. Zhang, Bruce D. Lee, Ingvar Ziemann, George J. Pappas, Nikolai Matni

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

A driving force behind the diverse applicability of modern machine learning is the ability to extract meaningful features across many sources. However, many practical domains involve data that are non-identically distributed across sources, and statistically dependent within its source, violating vital assumptions in existing theoretical studies. Toward addressing these issues, we establish statistical guarantees for learning general nonlinear representations from multiple data sources that admit different input distributions and possibly dependent data. Specifically, we study the sample-complexity of learning T+1 functions f_^(t) g_ from a function class F G, where f_^(t) are task specific linear functions and g_ is a shared nonlinear representation. A representation g is estimated using N samples from each of T source tasks, and a fine-tuning function f^(0) is fit using N' samples from a target task passed through g. We show that when N C_dep (dim( F) + C( G)/T), the excess risk of f^(0) g on the target task decays as _div (dim( F)N' + C( G)N T ), where C_dep denotes the effect of data dependency, _div denotes an (estimatable) measure of task-diversity between the source and target tasks, and C( G) denotes the complexity of the representation class G. In particular, our analysis reveals: as the number of tasks T increases, both the sample requirement and risk bound converge to that of r-dimensional regression as if g_ had been given, and the effect of dependency only enters the sample requirement, leaving the risk bound matching the iid setting.

Tasks

Reproductions