< Terug naar vorige pagina

Publicatie

Tractable Approximations for Achieving Higher Model Efficiency in Computer Vision

Boek - Dissertatie

The 2010s have seen the first large-scale successes of computer vision "in the wild", paving the way for industrial applications. Thanks to the formidable increase of processing power in consumer electronics, convolutional neural networks have led the way in this revolution. With enough supervision, these models have proven able to surpass human accuracy on many vision tasks. However, rather than focusing exclusively on accuracy, it is increasingly important to design algorithms that operate within the bounds of a computational budget - in terms of latency, memory, or energy consumption. The adoption of vision algorithms in time-critical decision systems (such as autonomous driving) and in edge computing (\eg{} in smartphones) makes this quest for efficiency a central challenge in machine learning research. How can the optimization of existing models be improved, in order to reach higher accuracy without affecting the processing requirements? Alternatively, can we search for models that fit the processing requirements while improving the accuracy on the task? In this thesis, we consider both of these questions, which are two sides of the same coin. On one hand, we develop novel methods for learning model parameters in a supervised fashion, improving the accuracy on the target task without affecting the efficiency of these models at test-time. On the other, we study the problem of model search, where the model itself must be selected among a family of models in order to achieve satisfactory accuracy under the resource constraints. Chapter 3 introduces the probably submodular framework for learning the weights of pairwise random graphical models. Graphical models are expressive and popular models, used notably in semantic segmentation. However, their inference is NP-hard in general. In order to ensure efficient inference, it is necessary to constrain the weights learned during training. Popular tractability constraints are definitely submodular constraints; they ensure that the local potential functions of the model are submodular for any input at test-time. We show that these constraints are often too conservative. Rather than enforcing that the graphical model is submodular for any input graph, it is sufficient to ensure submodularity with high probability for the data distribution of the task. We show on several semantic segmentation and multi-label classification datasets the superiority of this approach, validating the corresponding gain in model expressivity and accuracy, without compromising the efficient inference at test-time. Chapter 4 presents improved optimization methods to reduce the test-time error of semantic segmentation models, by introducing novel task-specific losses. In recent years, convolutional neural networks have dominated the state of the art in semantic segmentation. These networks are usually trained with a cross-entropy loss, which is easy to use within first-order optimization schemes. However, segmentation benchmarks are usually evaluated under other metrics, such as the intersection-over-union measure, or Jaccard index. A direct optimization of this measure, while challenging, can yield a lower error rate. Such gains are relevant to applications, as the Jaccard index has been shown to be closer to human perception, and benefits from scale invariance properties. Using the Lovász extension of submodular set functions, we develop tractable surrogates for the optimization of the Jaccard index in the binary and multi-label settings, compatible with first-order optimizers. We demonstrate the gains of our method in terms of the target metric on binary and multi-label semantic segmentation problems, using state-of-the art convolutional networks on the Pascal VOC and CityScapes datasets. Chapter 5 considers the problem of neural architecture search, where one wants to select the best-performing model satisfying the computational requirements among a large search space. We aim to adjust the channel numbers of a given neural network architecture, \ie{} the number of convolutional filters in each layer of the model. We first develop a method to predict the latency of the model given its channel numbers, using a method relying on least-square estimation of the predictor without the need to access low-level details about the computation on the inference engine. We then build a proxy for the model error that decomposes additively over individual channel choices, by using aggregated training statistics of a slimmable model on the same search space. The association of the pairwise latency model and the unary error estimates leads to an objective that can be optimized efficiently using the Viterbi algorithm, yielding the OWS method. A refinement of OWS, named AOWS, adaptively restricts the search space towards optimal channel configurations during the training of the slimmable network. We validate our approach over several inference modalities and show improved final performance of the selected models within given computational budgets. Overall, this thesis proposes novel methods for improving the accuracy/efficiency tradeoff of contemporary machine learning models, using methods derived from first principles and validated through experimentation on several contemporary computer vision problems. This research paves the way towards a smarter usage of the computational resources of machine learning methods, curbing the trend for "wider and deeper" models in order to face the challenges of time-critical and carbon-neutral AI.
Jaar van publicatie:2020
Toegankelijkheid:Open