Generalization Bounds for Neural Networks via Approximate Description Length
Amit Daniely, Elad Granot
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
We investigate the sample complexity of networks with bounds on the magnitude of its weights. In particular, we consider the class \[ H= \W_t W_1 :W_1, ,W_t-1 M_d, d, W_t M_1,d \ \] where the spectral norm of each W_i is bounded by O(1), the Frobenius norm is bounded by R, and is the sigmoid function e^x1+e^x or the smoothened ReLU function (1+e^x). We show that for any depth t, if the inputs are in [-1,1]^d, the sample complexity of H is O(dR^2^2). This bound is optimal up to log-factors, and substantially improves over the previous state of the art of O(d^2R^2^2). We furthermore show that this bound remains valid if instead of considering the magnitude of the W_i's, we consider the magnitude of W_i - W_i^0, where W_i^0 are some reference matrices, with spectral norm of O(1). By taking the W_i^0 to be the matrices at the onset of the training process, we get sample complexity bounds that are sub-linear in the number of parameters, in many typical regimes of parameters. To establish our results we develop a new technique to analyze the sample complexity of families H of predictors. We start by defining a new notion of a randomized approximate description of functions f:XR^d. We then show that if there is a way to approximately describe functions in a class H using d bits, then d/^2 examples suffices to guarantee uniform convergence. Namely, that the empirical loss of all the functions in the class is -close to the true loss. Finally, we develop a set of tools for calculating the approximate description length of classes of functions that can be presented as a composition of linear function classes and non-linear functions.