Neural Networks in a 3D World: Learning from Geometry
With the rise of deep learning, computer vision research has become largely dominated by machine learning for many applications. A large group of these applications stem from the robotics community, applying computer vision to autonomous driving for instance. Reasoning about 3D environments is still difficult for a deep learning model, which causes difficulties in some computer vision fields such as camera localization. Hand crafted geometrical algorithms are still popular for those fields, though monocular cameras can constrain their effectiveness.
This thesis explores ways how ideas from those hand crafted algorithms and deep learning can be combined. The goal is to find solutions that combine the strengths of both fields. The hand crafted algorithms tend to require few, if any, training data and generalize better. On the other hand, deep learning requires a large dataset to train the model, but can learn relations that are difficult to model. In particular, we look at 4 different problems, all focussed on applications using only a monocular camera.
We show that we can use a convolutional neural network (CNN) to control a quadrotor drone to avoid obstacles by training the network to estimate depth from a single image. We also show that the model suffers from generalization issues when applying it on an environment that is different from its training data.
Improving this model would normally require the acquisition of more training data. We show that it is possible to improve the initial model using only monocular video footage, which is much easier to collect. This unsupervised improvement relies on a SLAM algorithm, which would normally suffer from scale ambiguity. We resolve this scale ambiguity using the initial depth estimation model.
We localize a camera in a known environment using both deep learning and geometric relations between 3D points. When localizing the camera it is impractical to fully rely on a geometry-based solution or an end-to-end deep learning solution. We propose a method that relies on a CNN to roughly localize the camera based on visual features and then use a keypoint detector to refine the position of the camera to an accurate location.
Finally, we show that we can predict 3D motion of visible vehicles without any ground truth information. We use a standard method called Iterative Closest Points (ICP) to align pointclouds of objects between two frames, which serves as a supervision signal for a CNN. The CNN learns to predict the 3D motion of objects using a sequence of RGB frames, requiring no expensive depth sensors during inference.