Dereverberation of Speech Using Autoregressive Models of Sub-band Envelopes
Abstract
Automatic speech recognition (ASR) based technologies are radically changing the way we interact with digital services and information. Most of these application leverage on hands-free speech, where talkers are able to speak at a distance from the microphones without the nuance of handheld or body-worn device. The applications like, meeting annotations, speech to text transcription in teleconferencing, hands-free interfaces for controlling consumer-products, like interactive TV, virtual assistants in mobile phones, smart speakers etc, will benefit from distant talking mode of operation. The main issues in distant talking speech recognition is the corruption of speech signals by noise and the reverberation. This thesis is focused on developing dereverberation methods for speech processing using sub-band temporal envelopes.
This thesis pursues two broad directions for addressing issues in far-field ASR. In the first part of the thesis, two methods for dereverberation are proposed. In the second part of the thesis, we develop a speech enhancement model, where the audio signal is re-synthesized using dereverberated temporal envelopes and corresponding carrier components.
In the first part of the thesis, two methods to address reverberation is developed. The first method deals with developing a 3-D Acoustic modeling framework for far-field ASR (Automatic Speech Recognition), where spatio-spectral features from all the available channels are extracted. The features that are input to the 3-D CNN are extracted by modeling the signal peaks in the spatio-spectral domain using a multi-variate autoregressive (AR) modeling approach. This AR model is efficient in capturing the channel correlations in the frequency domain of the multi-channel signal. In the second method, a neural model for speech dereverberation using the long-term sub-band envelopes of speech is developed. The neural dereverberation model estimates the envelope gain, which when applied to reverberant signals, suppresses the late reflection components in the far-field signal. The dereverberated envelopes are used for feature extraction in speech recognition.
The second part of the thesis deals with envelope-carrier based speech enhancement. Here, we investigate the effect of far-field artifacts on temporal envelopes and the corresponding carrier components. A dual path recurrent neural model is used to parallelly learn the mapping for the reverberant envelopes and the carrier signals. Further, joint learning of the speech enhancement model with the end-to-end ASR model a single neural model is proposed.
Both parts of the thesis use the frequency domain linear prediction (FDLP) based model for extracting the envelopes of the sub-band signals in long analysis windows. We show several ASR and speech quality experiments to highlight the benefits of the proposed techniques.