< Back to previous page

Project

Cross-lingual and Cross-modal Fashion Product Linking ans Search

The internet has dramatically changed the way people buy and sell products. It has caused a shift from physical to digital retail, which has only been accelerated by the recent COVID-19 pandemic. What attracts people to online shopping is the wider selection, better prices, and the convenience of shopping anywhere, anytime. However, online customers can also easily be overwhelmed by the wide range of products which can sometimes hinder them in finding the relevant products. Hence, applications that automatically organize, search, retrieve and recommend products are extremely valuable. Such applications require an in-depth understanding of the product attributes which are expressed through different modalities, i.e., vision, language, video and audio. In this dissertation, we design deep neural networks to obtain intermodal and multimodal representations of fashion products from vision and language data that capture relevant fine-grained product attributes and that facilitate e-commerce search and recommendation.

Firstly, it is important that customers can search for products in a webshop by expressing the desired product attributes in a flexible way. Therefore, we develop a neural network architecture which learns intermodal representations for visual and textual product attributes that capture their semantic similarity. This neural network is trained with a loss function that incorporates similarity information of image regions to recover semantic similarity information lost due to the noise and incompleteness of the product descriptions. We evaluate the intermodal representations in cross-modal search and demonstrate that our model outperforms a state-of-the-art model on image search. Moreover, we show that the intermodal representations can be used for a novel multimodal fashion search paradigm where a target collection is searched with a query composed of both an image and text. We build a multimodal retrieval model which ranks images based on their relevance to a multimodal query and which allows to control how much the query text changes the query image. We demonstrate that our model is able to retrieve images that both exhibit the necessary query image attributes and satisfy the query texts. Furthermore, we show that our model substantially outperforms three state-of-the-art retrieval models adapted to multimodal fashion search.

Secondly, we learn intermodal representations for visual and multilingual textual fashion attributes which enable to automatically translate product descriptions. This is useful as e-retailers struggle to provide product descriptions in a variety of different languages. We propose two variants of a multimodal neural machine translation model which improve the translation through visually grounded representations of source words referring to fine-grained product attributes embedded in a multimodal, multilingual space. We achieve state-of-the-art multimodal translation results on two real-world e-commerce datasets of fashion and furniture and on the Multi30K benchmark dataset.

While we learn intermodal representations of visual and textual attributes to facilitate cross-modal search, multimodal search and the translation of product descriptions, outfit recommendation requires multimodal representations of fashion items. As the visual and textual item data not only share product attributes but also contain complementary product attributes, there is the need to effectively combine them. Therefore, we perform a comparative study of fusion methods that combine relevant product attributes extracted from visual and textual data in multimodal item representations for outfit recommendation, where a variety of fashion product features play a role in creating outfits. We compare traditional fusion methods with attention-based fusion methods, which are designed to focus on fine-grained product attributes. We evaluate the fusion methods on four benchmark datasets and provide insights into the importance of the multimodality and granularity of the fashion item representations. We show that the average performance of attention-based fusion methods surpasses the average performance of traditional fusion methods on three out of the four benchmark datasets, demonstrating the ability of attention to learn relevant correlations among fine-grained fashion attributes.

Finally, there is an increasing interest in interpretable representations and explainable deep neural networks that can generate explanations for search and recommendation results. Therefore, we design a variational autoencoder framework which learns disentangled multimodal item representations that are interpretable. More precisely, we propose a variational autoencoder that jointly learns to disentangle fashion images and to align the disentangled representations with textual attributes. This alignment not only allows to interpret the components of the disentangled space, but also provides a method to steer the variational autoencoder towards discovering relevant factors of variation which is necessary for fashion e-commerce data where the visual search space is huge and noisy. Furthermore, we demonstrate how to generate explanations for the disentangled representation's coordinates based on the inferred alignments of the disentangled representations with textual attributes.

Date:13 Dec 2016 →  31 Mar 2022
Keywords:Cross-modal search, Multimodal search, Cross-lingual linking
Disciplines:Applied mathematics in specific fields, Computer architecture and networks, Distributed computing, Information sciences, Information systems, Programming languages, Scientific computing, Theoretical computer science, Visual computing, Other information and computing sciences
Project type:PhD project