Taming Nonconvexity in Kernel Feature Selection -- Favorable Properties of the Laplace Kernel
Feng Ruan, Keli Liu, Michael I. Jordan
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Kernel-based feature selection is an important tool in nonparametric statistics. Despite many practical applications of kernel-based feature selection, there is little statistical theory available to support the method. A core challenge is the objective function of the optimization problems used to define kernel-based feature selection are nonconvex. The literature has only studied the statistical properties of the global optima, which is a mismatch, given that the gradient-based algorithms available for nonconvex optimization are only able to guarantee convergence to local minima. Studying the full landscape associated with kernel-based methods, we show that feature selection objectives using the Laplace kernel (and other _1 kernels) come with statistical guarantees that other kernels, including the ubiquitous Gaussian kernel (or other _2 kernels) do not possess. Based on a sharp characterization of the gradient of the objective function, we show that _1 kernels eliminate unfavorable stationary points that appear when using an _2 kernel. Armed with this insight, we establish statistical guarantees for _1 kernel-based feature selection which do not require reaching the global minima. In particular, we establish model-selection consistency of _1-kernel-based feature selection in recovering main effects and hierarchical interactions in the nonparametric setting with n p samples.