< Back to previous page

Project

Noise Robust Automatic Speech Recognition Based on Spectro-Temporal Techniques

Although noise robust automatic speech recognition (ASR) has been a topic of intensive research, to date it still cannot match the corresponding human performance. This motivated many researchers to search for ways to improve the robustness of automatic speech recognition based on human speech perception. One popular way of achieving this is spectro-temporal processing. In spectro-temporal processing, inspired by an examination of the receptive fields of auditory neurons, the aim is to capture the spectral and temporal modulations of the signal simultaneously. In this study we examine several methods based on this technique.

First, we examined two fundamental approaches of spectro-temporal processing, namely the extraction of 2D DCT coefficients and the utilisation of Gabor filters. In our experiments with 2D DCT coefficients we first attempted to find optimal parameter values for spectro-temporal feature extraction. After these initial experiments we compared our 2D DCT features with MFCCs derived from experiments conducted on the TIMIT speech corpus. We demonstrated that using 2D DCT features we can get similar or even better results than those got using MFCCs. When working with Gabor filters, we first examined several filter selection methods. Then we compared the performance of the resulting filter sets as well as the performance of filter sets described in the speech recognition literature. Here, we found that the filter set we created based on simple heuristics gave the best performance score, outperforming other filter sets, as well as the MFCCs on both clean and noise contaminated speech.

Then, motivated by the obstacles that arose in automatic feature selection, we introduced a method for joint training. This framework works by integrating the feature extraction step into the lowest layer of a neural net, effectively combining the stages of secondary feature extraction and neural net training. This combination consistently led to lower phone error rates in our experiments compared to the error rates attained when the two stages were carried out separately. Furthermore, when adding modifications to this joint training model, based on the current advances in neural network research, we managed to further reduce the resulting error rates, demonstrating the framework's capability for improvement. These experiments also demonstrated that training the initial filter coefficients is useful. In spite of the benefit obtained from training our filters, it was also shown to be advantageous to begin this process with a good initial filter set.

We also examined the combination of spectro-temporal and multi-band processing, motivated by the compatibility of these approaches. First, we demonstrated the viability of this combination on the TIMIT database for both clean and noise contaminated speech. Then, by introducing Deep Neural Networks and the technique of convolution into this framework, we decreased the error rates still further. We also successfully incorporated the multi-band approach into our joint training framework. When evaluating the resulting method on the clean training scenario of the Aurora-4 speech recognition task, we attained error rates that - at the time of their publication - were among the lowest published for the given task. 

Lastly, after the parameters of the joint training framework had been reexamined and modified, we supplemented our joint training technique with a method inspired by input dropout and multi-band processing. Here, the input dropout was applied in such a way that in complete batches whole frequency bands were ignored. With this method, similar to multi-band processing, we strove to improve the robustness of the trained model by forcing the network to rely less on the whole spectrum. We evaluated this method on the Aurora-4 database, using both mel-spectral features and ARMA features. Our results indicated that in the clean training scenario, band dropout significantly improved the results compared to those got using no dropout or standard input dropout. What is more, when used in conjunction with ARMA features, the band dropout method produced significantly better results than those listed earlier on. It gave a performance score that is among the best reported for the given task.

Date:16 Apr 2012 →  29 Mar 2018
Keywords:Automatic Speech Recognition, Noise Robust Speech Recognition, Spectro-Temporal Processing, Deep Learning
Disciplines:Nanotechnology, Design theories and methods
Project type:PhD project