< Back to previous page

Publication

Function Norms for Neural Networks: Theory and Applications

Book - Dissertation

Deep Neural Networks (DNNs) have shown to be one of the most powerful tool for large-scale machine learning applications. The recent years have witnessed an explosion in the number of applications and successes of these models, ranging from image recognition and segmentation to text and speech understanding and generation. However, our theoretical understanding of these models is fairly limited in comparison to this significant leap in applications. Taking inspiration from classical theories of statistical and structured learning, we focus in this thesis on developing, analysing and applying a complexity measure for neural networks. Apart from taking a step forward towards a better understanding of the class of functions spanned by a neural network, such a measure can provide a well-founded regularizer as well as a metric to compare two different models. More precisely, we investigate in this work the feasibility and potential of using a function norm as a complexity measure. As a first result, we show that computing function norms of a DNN of 3 or more layers is intractable. By reducing the Max-Cut NP-complete problem, we prove that deciding if a function norm is zero is NP-hard. However, we observe that norms that are computed with respect to a probability measure can be effectively approximated by a sampling based approach. The use of this approximation as a regularizer is then further investigated both theoretically and experimentally. For the theory side, we prove a generalization bound for models that live in a subspace defined by this approximate norm under realistic constraints. Experimentally, we test this regularizer on different models, benchmarks and a real life application on medical images, confirming its effect on reducing overfitting. Another potential use of a complexity measure is to compare two different models. The approximate norm we propose can naturally define a metric for this purpose. We analyze this metric in two settings: continual learning and model compression. In continual learning, the purpose is to learn from a continuous stream of data. The model can receive a sequence of data or tasks and learn continually from them. The challenge in this setting is to preserve the accumulated knowledge while moving forward in the sequence and overcome catastrophic forgetting. A possible solution to this problem is to constraint the training of the most recent task to prevent losing information that is crucial to previous task. Such a constraint can be defined by means of a distance between two submodels. We propose a solution where the crucial information of a task is captured by an autoencoder trained on the output of the early layers of a network. We therefore prevent the submodel obtained by stacking these early layers and the trained encoder from changing during training. We achieve this purpose by bounding the approximate distance between the submodel obtained at the end of the training of the previous task and the one trained on the current task. The use of this method on sequences of benchmarks shows an improvement over a state-of-the-art method. Model compression is another important research direction in recent years. The rise of mobile and embarked machine learning applications, such as smartphones, self-driving cars, intelligent medical devices, has increased the interest for small and efficient models. In compression, we aim to reduce the memory footprint of a model without significantly degrading its performance. The proposed approximate distance between the original and the compressed model provides a good proxy for this loss in performance. We propose a method where this measure is used in a Bayesian optimization framework. Our results show that this approach, combined with a specialized bayesian optimization routine, converges faster and reaches a better size/performance Pareto front with respect to prior art. Overall, this thesis uncovers a fundamental shortcoming in applying vanilla statistical learning-inspired regularization methods to DNNs function spaces; but provides nevertheless techniques to work around this limitation and obtain complexity measures in such spaces. The uses of such measures as a regularizer and as a distance are analyzed in multiple settings and several benchmarks. The positive impact of these approaches show that theoretical and practical study of DNN function space can work hand in hand to yield competitive improvements to machine learning systems.
Publication year:2020
Accessibility:Open