Neural Network Loss Landscapes

The goal of this project is to develop a mathematical framework to characterize the loss landscape of neural networks along with a variety of other formulations that arise in non-convex optimization. In particular, a wide variety of non-convex optimization problems can be captured by a general form:

$$ \min_{W} \ell(Y,\Phi(W,X)) + \lambda \Theta(W) $$

where $(X,Y)$ is a given set of training data, $W$ is some set of model parameters one is trying to optimize over (e.g., network weights), $\Phi$ is some model prediction given the model parameters and the input data ($X$), $\ell$ is some loss function, and $\Theta$ is some regularization function on the model parameters (and optionally also the data $X$).

Many commonly used loss functions in machine learning are convex w.r.t. the model prediction (i.e., the output of $\Phi$) - for example, mean squared error, logstic loss (a.k.a. softmax + cross-entropy), etc - and likewise many common regularization functions $\Theta$ are convex w.r.t. $W$. However, the overall problem is typically non-convex due to the convexity-destroying mapping of the model, $\Phi$.

Current Results

One of the goals of this project is to develop general conditions on the loss function, regularization function, and model mapping (e.g., the network architecture) which allow for the loss surface to be characterized. In particular, we show that if $\Phi$ and $\Theta$ are positively-homogeneous functions, that is for some $p>0$ they satisfy

$$\Phi(\alpha W,X) = \alpha^p \Phi(W,X) \ \ \text{and} \ \ \Theta(\alpha W) = \alpha^p \Theta(W) \ \forall \alpha > 0,$$

then if the network is sufficiently ‘large’ then there will always exist a non-increasing path to a global minimizer from any initialization and a condition for global optimality can be established from only local information. It turns out that most modern network architectures are based on positive homogeneous operators (linear operators such as convolutions and fully-connected layers) and non-linearities (e.g., ReLUs and max-pooling), allowing these results to be applied to a wide variety of commonly used network blocks.

References

Haeffele and Vidal. “Global Optimality in Neural Network Training.” CVPR (oral - top 2.5% of submissions). (2017)

Haeffele and Vidal. “Global Optimality in Tensor Factorization, Deep Learning, and Beyond.” arxiv preprint. (2015)

Associate Research Scientist

My research interests include machine learning, computer vision, optimization, computational imaging, and biomedical applications.