< Back to previous page

Project

Joint Sound Source Segregation and Identification

Many speech technology applications expect speech input from a single speaker and usually fail when multiple speakers are active, especially when speech overlaps. However, in many situations there are multiple people within reach of the recording device and therefore there is a high chance of multiple active speakers. In Source Separation (SS) the different speech sources are separated to obtain a signal for each speaker. As we separate the sources we would also like to know the identity or the characteristics of the speaker by Speaker Recognition (SR) so we can re-identify the speaker later on. If both SS and SR are being applied we can track a single speaker throughout, for example, a recording of a business meeting. SS and SR are usually being treated as separate problems. However, when blindly separating a speech mixture, characterization of the sources is inherently necessary. Moreover, when recognizing speakers in overlapping speech, every speaker is associated with part of the audio fragment and thus source separation is inherently active. The main hypothesis of the PhD thesis is that when both are done jointly, they constructively help each other to achieve greater performance. A sequential approach where first SS is done, followed by SR, is less efficient as each step is optimized independently and neglects the other step.

Chapters 2 and 4 of the thesis will look for such joint models and indeed find that they outperform sequential approaches. In Chapter 2 this is done using Nonnegative Matrix Factorization (NMF), while in Chapter 4 Deep Neural Networks (DNNs) are used. This shows that it is expected that the reason that the joint models outperform the sequential ones, is not due to the model choice for SS and SR but rather intrinsic to the tasks themselves. Chapters 3 and 6 will not explicitly search for a joint model, but will rather show that indeed speaker characteristic information is inherently required when performing SS. Finally, Chapters 4 and 7 will consider more applied and practical aspects of SS, but in both the link with SR will be made.

Date:18 Sep 2015 →  29 Sep 2020
Keywords:Machine Learning, Nonnegative Matrix Factorization, Deep Learning, Speech Processing, Source Separation, Speaker Recognition
Disciplines:Audio and speech computing, Pattern recognition and neural networks
Project type:PhD project