Spectro-Temporal Features For Robust Automatic Speech Recognition
Abstract
The speech signal is inherently characterized by its variations in time, which get reflected as variations in frequency. The specto temporal changes are due to changes in vocaltract, intonation, co-articulation and successive articulation of different phonetic sounds. In this thesis we are looking for improving the speech recognition performance through better feature parameters using a non-stationary model of speech. One effective means of modeling a general non-stationary signal is using the AM-FM model. AM-FM model can be extended to speech through a sub-band analysis, which can be mimic the auditory analysis. In this thesis, we explore new methods for estimating AM and FM parameters based on the non-uniform samples of the signal. The non-uniform sample approach along with adaptive window estimation provides for important advantage because of multi-resolution analysis. We develop several new methods based on ZC intervals, local extrema intervals and signal derivative at ZC’s as different sample measures of the signal and explore their effectiveness for instantaneous frequency (IF) and instantaneous envelope (IE) estimation. To deal with speech signal for automatic speech recognition, we explore the use of auditory motivated spectro temporal information through the use of an auditory filter bank and signal parameters (or features) are derived from the instantaneous energy in each band using the non-linear energy operator over a larger window length. The temporal correlation present in the signal is exploited by using DCT and keeping the lower few coefficients of DCT to keep the trend in the energy in each band. The DCT coefficients from different frequency bands are concatenated together, and a further spectral decorrelation is achieved through KLT (Karhunen-Loeve Transform) of the concatenated feature vector. The changes in the vocaltract are well captured by the change in the formant structure and to emphasize these details for ASR we have defined a temporal formant by using the AM-FM decomposition of sub-band speech. A uniform wideband non-overlaping filters are used for sub-band decomposition. The temporal formant is defined using the AM-FM parameters of each subband signal. The temporal evolution of a formant is represented by the lower order DCT coefficients of the temporal formant in each band and its use for ASR is explored. To address the robustness of ASR performance to environmental noisy conditions, we have used a hybrid approach of enhancing the speech signal using statistical models of the speech and noise. Use of GMM for statistical speech enhancement has been shown to be effective. It is found that the spectro-temporal features derived from enhanced speech provide further improvement to ASR performance.
Collections
Related items
Showing items related by title, author, creator and subject.
-
Demodulation of Narrowband Speech Spectrograms
Aragonda, Haricharan (2017-11-22)Speech is a non-stationary signal and contains modulations in both spectral and temporal domains. Based on the type of modulations studied, most speech processing algorithms can be classified into short-time analysis ... -
Joint Evaluation Of Multiple Speech Patterns For Speech Recognition And Training
Nair, Nishanth Ulhas (2009-12-09)Improving speech recognition performance in the presence of noise and interference continues to be a challenging problem. Automatic Speech Recognition (ASR) systems work well when the test and training conditions match. ... -
Characterization of the Voice Source by the DCT for Speaker Information
Abhiram, B (2017-12-10)Extracting speaker-specific information from speech is of great interest to both researchers and developers alike, since speaker recognition technology finds application in a wide range of areas, primary among them being ...