Gradient-Based Feature Learning of Neural Networks - Institute of Applied Mathematics

Pizza lunch will be served preceding the talk, starting at noon.

I will present two recent works on the feature-learning ability of neural networks trained with (stochastic) gradient-descent-like algorithms. In the first, we will consider the task of training a two-layer neural network with SGD to learn a multiple-index model, i.e. a function whose value depends only on a k-dimensional projection of the input data, where k is much smaller than the ambient data dimension d. When the input data is isotropic Gaussian, we will show that training the first layer of the network with SGD and weight-decay results in convergence to the subspace spanned by the relevant k directions, with a number of samples linear in d (hence efficient feature learning). As an application of this convergence, we will show that single-index (k=1) monotone functions can be learned by SGD-trained neural networks using almost linear in d sample complexity.

Next, we will investigate the effect of additional structure in the data covariance for learning a general single-index model. Recent work has shown that in the isotropic case, the sample Alicomplexity of gradient-based algorithms for training neural networks is governed by a quantity of the target function called the information exponent. However, we will show that for a spiked covariance model, the sample-complexity can go through a three-stage phase transition depending on the magnitude of the spike, and can be significantly improved in comparison to the isotropic case for large spike magnitudes, where in the extreme case the sample complexity can even be independent of the information exponent.