< Back to previous page
Deep Learning Methods for Multimodal Representation Learning
Book - Dissertation
An interesting challenge in current deep learning research is the ability to process different modalities. Of particular interest are language and vision, that make up a sizable portion of text books, websites, medical data and other important media. These modalities have very different characteristics however. Human communication uses discrete symbols while visual information is more easily expressed by continuous ranges of hue and brightness. Artificial neural networks typically employ continuous-valued vectors to represent information. This leans closer to the fabric of vision than the symbol-based nature of language. In neural network training, the textual information is therefore usually transformed into a continuous-valued embedding that accommodates gradient-based update methods.In this dissertation, I investigate manners to bridge the gap between language and vision in neural networks. I investigate how continuous language vectors can be integrated in several useful settings. Additionally, several improvements and research directions for current neural network representations are suggested.Concretely, I first research how the mapping of natural language sentences to continuous embeddings can be useful for several applications. I show that such embeddings can be evaluated by generating images from such embeddings. Additionally, such techniques can be used to help justify diagnosis outcomes with neural networks. I demonstrate a methodology to generate a counterfactual example for a particular diagnosis.I draw inspiration from human communication, that relies on discrete symbols, to suggest possible improvements for current neural network representations. Instead of only relying on continuous embeddings, I suggest to either add discrete symbols or additional structure. I argue that this approach will lead to better generalization and reasoning in neural network models. I illustrate this with two concrete applications. In the first application, I design a novel method with neural networks to learn representations that are informed by different modalities. These representations combine desirable properties from both discrete and continuous representations. They are learned by estimating distances between class and contextual distributions. The second application, enforces a pose-independent 2D representation with a direct link to 3D information. These representations are semantically aligned so that certain parts of an object are always mapped to the same location of the representation. I show how this fact can be used in a generative 3D model where language controls visual characteristics such as the shape and the color.The main conclusion of the thesis is that, while current neural network representations have interesting qualities to bridge language and vision, the inclusion of discrete components can lead to desirable improvements. By imposing more structure or adding discrete symbols, neural networks can achieve more control and compositionality. I demonstrate this in two concrete settings and open several interesting directions for future research.