< Back to previous page

Project

Capsule Networks for Automatic Speech Recognition

Deep neural networks (DNNs) have caused a tremendous revolution in many aspects of artificial intelligence (AI), including speech processing, natural language processing and image processing. DNNs are known to be data-hungry, i.e. they require large amounts of labeled training data. This type of data is expensive to acquire in most applications. In speech recognition, it means that we need to know what the orthographic transcriptions of each utterance is (i.e. someone needs to write down what he hears). In the last few years, progress has been made in unsupervised learning, such that smaller amounts of annotated data can be augmented with unannotated data, which is easier to get (i.e. recordings of people talking). One of the reasons why DNNs need so much training data is that the “learn away” variation in the data. In visual object recognition, different inputs originating from different illuminations, poses and view angles of the same object are all mapped onto the same classes, often using many convolutional neural network (CNN) layers. The CNN kernels trigger on patterns of increasing complexity as we move deeper into the network. The lowest layers may trigger on properties like specific line orientations while higher layers trigger on shapes composed of these lines. However, the network needs to learn that different poses and view angles lead to line segments of a different orientation in the lowest layers and that the position of these segments change. The higher level layers need to learn to map this form of variation onto the same object categories, i.e. it has to “learn away” the variation. A similar issue of data variation is present in speech recognition: different persons have different voice characteristics, which is a.o. reflected in different formant frequencies. Yet, the phoneme classes need to remain the same. Background noise or competing speakers lead to corrupted low-level features. Different persons may use different grammatical constructs to express their ideas, etc. The methods explored in this proposal take the approach of not “learning away” variation, but propagating it to the higher layers instead. 

Date:23 Sep 2019 →  23 Sep 2023
Keywords:Automatic Speech Recognition
Disciplines:Audio and speech processing
Project type:PhD project