< Back to previous page

Project

Spatio-Temporal Speech Enhancement in Adverse Acoustic Conditions

Never before has speech been captured as often by electronic devices equipped with one or multiple microphones, serving a variety of applications. It is the key aspect in digital telephony, hearing devices, and voice-driven human-to-machine interaction.

When speech is recorded, the microphones also capture a variety of further, undesired sound components due to adverse acoustic conditions. Interfering speech, background noise and reverberation, i.e. the persistence of sound in a room after excitation caused by a multitude of reflections on the room enclosure, are detrimental to the quality and intelligibility of target speech as well as the performance of automatic speech recognition. Hence, speech enhancement aiming at estimating the early target-speech component, which contains the direct component and early reflections, is crucial to nearly all speech-related applications presently available.

In this thesis, we compare, propose and evaluate existing and novel approaches in the fields of speech enhancement. At this, we take account of the following technical aspects, which guide the design of the proposed approaches. First, we envisage comprehensive speech enhancement in all varieties of adverse acoustic conditions, which requires dereverberation, interfering speech cancellation and noise reduction. Second, we aim to exploit spatial and temporal knowledge on the target-speech direction and statistics in form of relative early transfer function (RETF) and time-varying early power-spectral-density (PSD) estimates, and further to acquire such knowledge in a multi-source scenario. Third, we strive for online processing in dynamic acoustic scenarios, and fourth, for moderate computational complexity. 

The thesis is introduced by a problem description and a thorough overview on the state of the art. Major parts of the remainder relate to two broad concepts in multi-microphone speech enhancement, namely beamforming and blind deconvolution, specifically by means of the generalized sidelobe canceler (GSC) and multi-channel linear prediction (MCLP), respectively. While beamforming is a well-established approach to interfering speech cancellation and noise reduction, MCLP may be said to be the presently most popular approach to dereverberation.

As a preparatory step towards comprehensive speech enhancement, we analyze and compare the GSC and MCLP architecture in terms of their potential for dereverberation and noise reduction. They mainly differ in their data-dependent filter path, i.e. the sidelobe cancellation (SC) and the linear prediction (LP) filter paths, which entail spatial and temporal pre-processing by means of a blocking matrix (BM) and a delay, respectively. We show that in case of perfect spatial knowledge, the GSC reaches the same dereverberation performance as MCLP, while obviously performing noise reduction in addition, as opposed to MCLP. In case of deficient spatial knowledge, however, the GSC performs worse than MCLP in terms of dereverberation.

Based on this comparison and the recently common usage of MCLP-and-beamforming cascades, we propose to integrate the GSC and MCLP into a novel architecture referred to as integrated sidelobe cancellation and linear prediction (ISCLP), where the SC filter and the LP filter operate in parallel. We propose to estimate the SC and LP filters jointly and online by means of a single Kalman
filter. We further propose a spectral Wiener gain post-processor, relating to the Kalman filter’s posterior state estimate. While being computationally less demanding than two state-of-the-art approaches, the ISCLP Kalman filter is shown to perform similar or better in various adverse acoustic conditions.

The ISCLP Kalman filter exploits spatial and temporal target-parameter knowledge to be acquired in a multi-source scenario. To this end, we propose an appropriate online estimation approach, namely square root-based multi-source early PSD estimation and RETF updating. Here, as opposed to the conventional approach, we propose to factorize the early correlation matrix and minimize the approximation error defined with respect to the early-correlation-matrix square root. From the proposed minimization problem, we iteratively obtain estimates of a unitary matrix and the early PSD square roots, which further allow to recursively update the RETF estimate. Evaluation indicates better performance as compared to the conventional approach and convergence in only one iteration.

The ISCLP Kalman filter exhibits a quadratic computational complexity in the number of filter coefficients and the number of channels. We therefore propose low-complexity variants of the ISCLP Kalman filter. The low-complexity variants are obtained by enforcing the state estimation error correlation matrix to assume sparse structures corresponding to the negligence of either temporal, spatial, or all cross-correlations, leading to linear cost in either or both the number of filter coefficients and the number of channels. The low-complexity ISCLP Kalman filter variants are shown to perform nearly as well as the original variant, thereby permitting far more favourable trade-offs between complexity and performance.

The thesis is concluded by a summary, suggestions for future research and a discussion on industrial valorization.

Date:29 Jan 2015 →  8 Oct 2019
Keywords:Beamforming, Dereverberation, Speech Enhancement, Noise Reduction, Multi-Channel Linear Prediction
Disciplines:Applied mathematics in specific fields
Project type:PhD project