< Back to previous page

Project

Making Sense of Sound

The field of speech processing is highly developed from scientific and technological points of view. The special attention accredited to speech is clearly deserved since it is one of the most important manners of communication between humans. At the same time, other sounds can also carry meaning relevant to everyday life: a door being opened, footsteps, a car approaching, a baby screaming, an alarm bell ringing, etc. However, the domain of environmental sound recognition has only recently taken off. In comparison, it is thus presently much less mature. The goal of this thesis is to add to the state of the art in this field by broadening the scope of the research activities and addressing some of the limitations of the current paradigms.

A glaring opportunity for enhancing current sound recognition systems lies in the employment of alternative modalities on top of audio. Particularly, visual knowledge could evidently lead to improvements when it comes to making sound event predictions. This strategy also makes sense in the light of potential applications: Humans heavily rely on both auditory and visual cues to achieve an accurate and complete grasp of their surroundings. Consequentially, in order to create situation-aware machines, it is essential to investigate models that can also successfully utilize this additional source of information to perform interpretation of environmental events. In some cases, textual data might also be useful in this regard. In this project, the main goal is to scrutinize the integration of these extra modalities into sound recognition models under various circumstances. We show that applying transfer learning techniques to incorporate vision-related features can lead to better outcomes in a number of scenarios for both traditional artificial neural network architectures, as well as more novel attention-based deep learning systems such as transformers.

Date:21 Sep 2017 →  28 Apr 2023
Keywords:Cross-modal representations, Weakly annotated sound mixtures
Disciplines:Nanotechnology, Design theories and methods
Project type:PhD project