Deep learning for multimodal representation learning
There are many tasks that require multimodal representations: Visual question answering, cross-modal retrieval, phrase grounding among many. These domains are bridged together by the need of methods that can project all modalities in a joint and shared latent space. The space should be structured, and should capture the correspondence between the signal that generated the data. In this thesis, we aim to explore and improve the different methods that learn such multimodal representations. In particular, our goal is to improve individual unimodal representations learning methods, find better ways to fuse the unimodal representations into a multimodal representation and develop methods that jointly learn multimodal representations from multiple data streams.