Speech and noise analysis using sparse representation and acoustic-phonetics knowledge

Vijay Girish, Venkata K

View/Open

Thesis full text (5.768Mb)

Author

Vijay Girish, Venkata K

Metadata

Show full item record

Abstract

This thesis addresses different aspects of machine listening using two different approaches, namely (1) A supervised and adaptive sparse representation based approach for identifying the type of background noise and the speaker and separating the speech and background noise, and (2) An unsupervised acoustic-phonetics knowledge based approach for detecting transitions between broad phonetic classes in a speech signal and significant excitation instants called as glottal closure instants (GCIs) in voiced speech, for applications like speech segmentation, recognition and modification. Real life speech signals generally contain a foreground speech by a particular speaker in the presence of a background environment like factory or traffic noise. These audio signals termed as noisy speech signals are available in the form of recordings say, audio intercepts or real time signals which can be single channel or multi channel. Real time signals are available during mobile communication and in hearing aids. Processing of these signals has been approached by the research community for various independent applications like classification of components of the noisy speech signal, source separation, enhancement, speech recognition, audio coding, duration modification and speaker normalization. Machine listening encapsulates solutions to these applications in a single system. It extracts useful information from noisy speech signals, and attempts to understand the content as much as humans do. In the case of speech enhancement, especially for the hearing impaired, the suppression of background noise for improving the intelligibility of speech would be more effective, if the type of background noise can be classified first. Other interesting applications of noise identification are forensics, machinery noise diagnostics, robotic navigation systems and acoustic signature classification of aircrafts or vehicles. Another motivation to identify the nature of background noise is to narrow down to the possible geographical location of a speaker. Speaker classification helps us to identify the speaker in an audio intercept. In the supervised sparse representation based approach, a dictionary learning based noise and speaker classification algorithm is proposed using a cosine similarity measure for learning atoms of the dictionary and is compared with other non-negative dictionary learning methods. For training, we learn dictionaries for speaker and noise sources separately using the various dictionary learning methods. We have used the Active Set Newton Algorithm (ASNA) and supervised non-negative matrix factorization for source recovery in the testing phase. Based on the objective measure of signal to distortion ratio (SDR), we get the frame-wise noise classification accuracy of 97.8% for fifteen different noises taken from the NOISEX database. The proposed evaluation metric of sum of weights (SW) applied on concatenated dictionaries gives a good accuracy, for speaker classification on clean speech, using high energy subsets of test frames and dictionary atoms. We get the best utterance level speaker classification accuracy of 100% for 30 speakers taken from TIMIT database on clean speech. We have then dealt with noisy speech signals assuming a single speaker speaking in a noisy environment. The noisy speech signals have been simulated at different SNRs using different noise and speaker sources. We have classified the speaker and background noise class of the noisy speech signal and subsequently separated the speech and noise components. Given a test noisy speech signal, a noise label is assigned to a subset of frames selected using the SDR measure, and an accumulated measure is used to classify the noise in the whole test signal. The speaker is classified using the proposed metric of accumulated sum of weights on high energy features, estimated using ASNA with L1 regularization from the concatenation of speaker dictionaries and the identified noise source dictionary. Using the dictionaries of the identified speaker and noise source, we obtain the estimate of the separated speech and noise signal using ASNA with L1 regularization and supervised non-negative matrix factorization (NMF). We obtain around 98% accuracy for noise classification and 89% for speaker classification at an SNR of 10 dB for a combination of 30 speakers and 15 noise sources. In the case of an unknown noise, the noise source is estimated as the nearest known noise label. The distribution of an unknown noise source amongst the known noise classes gives an indication of the possible noise source. The dictionary corresponding to the estimated noise label is updated adaptively using the features from the noise-only frames of the test signal. The updated dictionary is then used for speaker classification, and subsequently separation is carried out. In the case of an unknown speaker, the nearest speaker is estimated and the corresponding dictionary is updated using a clean speech segment from the test signal. We assume that a clean speech segment is available for adapting the speech dictionary. We have observed an improvement in signal to distortion ratio (SDR) after separation of speech and noise components using an adaptive dictionary. Adaptive noise dictionary gives an improvement of about 18% in speaker classification accuracy and 4 dB in SDR over an out-of-set dictionary, after enhancement of noisy speech at an SNR of 0 dB. In the case of a conversation, a divide and conquer algorithm is proposed to recursively estimate the noise sources, and estimate the approximate instant of noise transition and the number of noise types. We have then experimented on a conversation simulated by concatenating two different noise signals, each containing speech segments of distinct speakers and obtained a mean absolute error in the detection of noise transition instant of 10 ms at -10 dB SNR. Each of the segments obtained based on the transition instant can be treated as a single noise mixed with speech from a single speaker and subsequent speaker classification and source separation can be done as in the previous case. We have also addressed the classification of speakers and subsequent separation of speakers in overlapped speech, obtaining a mean speaker classification accuracy of 84% for the speaker 1 to speaker 2 ratio (S1S2R) of 0 dB. The advantage of the proposed dictionary learning and sparse representation based approach is that the training and classification model is independent of the selected classes of speakers and noises. Dictionaries for new classes can be easily added or the old classes can be removed or replaced instead of retraining. Also, the same model can be used for identifying other types of classes like language and gender. We have achieved speaker and noise classification and subsequent separation using only spectral features for dictionary learning. This is in contrast to the stochastic model based approaches where the model needs to be retrained whenever a new class is added. In the unsupervised acoustic-phonetics knowledge based approach, we detect transitions between broad phonetic classes in a speech signal which has applications such as landmark detection and segmentation. The proposed rule based hierarchical method detects transitions from silence to non-silence, sonorant to non-sonorant and vice-versa. We exploit the relative abrupt changes in the characteristics of the speech signal to detect the transitions. Relative thresholds learnt from a small development set are used to determine the parameter values. We propose different measures for detecting transitions between broad phonetic classes in a speech signal based on abrupt amplitude changes. A measure is defined on the quantized speech signal to detect transitions between very low amplitude or silence (S) and non-silence (N) segments. The S-segments could be stop closures, pauses or silence regions at the beginning and/or ending of an utterance. We propose two other measures to detect the transitions between sonorant and non-sonorant segments and vice-versa. We make use of the fact that most sonorants have higher energy in the low frequencies, than other phone classes such as unvoiced fricatives, affricates and unvoiced stops. For this reason, we use a bandpass speech signal (60-340 Hz) for extracting temporal features. A subset of the extrema (minimum or maximum amplitude samples) between every pair of successive zero-crossings and above a threshold is selected from each frame of the bandpass filtered speech signal. Occurrences of the first and the last extrema lie far before and after the mid-point (reference) of a frame, if the speech signal belongs to a non-transition segment; else, one of these locations lie within a few samples from the reference, indicating a transition frame. The advantage of this approach is that it does not require significant training data for determining the parameters of the proposed approach. When tested on the entire TIMIT database for clean speech, of the transitions detected, 93.6% are within a tolerance of 20 ms from the hand labeled boundaries. Sonorant, unvoiced non-sonorant and silence classes and their respective onsets are detected with an accuracy of about 83.5% for the same tolerance using the labelled TIMIT database as reference. The results are as good as, and in some respects better than the state-of-the-art methods for similar tasks. The proposed method is also tested on the test set of the TIMIT database for robustness with respect to white, babble and Schroeder noise, and about 90% of the transitions are detected within the tolerance of 20 ms at the signal to noise ratio of 5 dB. We have also estimated glottal closure instants (GCIs) useful for a variety of applications such as pitch and duration modification, speaking rate modification, pitch normalization, speech coding/ compression, and speaker normalization. The instant at which the vocal tract is significantly excited within each glottal cycle in a speech signal is referred to as the epoch or the GCI. Subband analysis of linear prediction residual (LPR) is proposed to estimate the GCIs from voiced speech segments. A composite signal is derived as the sum of the envelopes of the subband components of the LPR signal. Appropriately chosen peaks of the composite signal are the GCI candidates. The temporal locations of the candidates are refined using the LPR to obtain the GCIs, which are validated against the GCIs obtained from the electroglottograph signal, recorded simultaneously. The robustness is studied using additive white, pink, blue, babble, vehicle, HF channel noises for different signal to noise ratios and reverberation. The proposed method is evaluated using six different databases and compared with three state-of-the-art LPR based methods. The GCI detection performance of the proposed algorithm is quantified using the following measures: identification rate (IDR), miss rate (MR), false alarm rate (FAR), standard deviation of error (SDE) and accuracy to 0.25 ms. We have shown that significant GCI information exists in each subband of speech up to 2000 Hz, and a minimum of 89% identification rate (for subbands other than lowpass) can be obtained for clean speech using the proposed method. The results show that the performance of the proposed method is comparable to the best of the LPR based techniques for clean, and noisy speech.

URI

https://etd.iisc.ac.in/handle/2005/4481

Collections

Electrical Engineering (EE) [357]