Speech and noise analysis using sparse representation and acoustic-phonetics knowledge
Abstract
This thesis addresses different aspects of machine listening using two different approaches, namely (1) A
supervised and adaptive sparse representation based approach for identifying the type of background
noise and the speaker and separating the speech and background noise, and (2) An unsupervised
acoustic-phonetics knowledge based approach for detecting transitions between broad phonetic classes
in a speech signal and significant excitation instants called as glottal closure instants (GCIs) in voiced
speech, for applications like speech segmentation, recognition and modification.
Real life speech signals generally contain a foreground speech by a particular speaker in the presence
of a background environment like factory or traffic noise. These audio signals termed as noisy
speech signals are available in the form of recordings say, audio intercepts or real time signals which
can be single channel or multi channel. Real time signals are available during mobile communication
and in hearing aids. Processing of these signals has been approached by the research community for
various independent applications like classification of components of the noisy speech signal, source
separation, enhancement, speech recognition, audio coding, duration modification and speaker normalization.
Machine listening encapsulates solutions to these applications in a single system. It extracts
useful information from noisy speech signals, and attempts to understand the content as much as
humans do. In the case of speech enhancement, especially for the hearing impaired, the suppression
of background noise for improving the intelligibility of speech would be more effective, if the type
of background noise can be classified first. Other interesting applications of noise identification are
forensics, machinery noise diagnostics, robotic navigation systems and acoustic signature classification
of aircrafts or vehicles. Another motivation to identify the nature of background noise is to narrow
down to the possible geographical location of a speaker. Speaker classification helps us to identify the
speaker in an audio intercept.
In the supervised sparse representation based approach, a dictionary learning based noise and
speaker classification algorithm is proposed using a cosine similarity measure for learning atoms of
the dictionary and is compared with other non-negative dictionary learning methods. For training,
we learn dictionaries for speaker and noise sources separately using the various dictionary learning
methods. We have used the Active Set Newton Algorithm (ASNA) and supervised non-negative
matrix factorization for source recovery in the testing phase. Based on the objective measure of signal
to distortion ratio (SDR), we get the frame-wise noise classification accuracy of 97.8% for fifteen
different noises taken from the NOISEX database. The proposed evaluation metric of sum of weights
(SW) applied on concatenated dictionaries gives a good accuracy, for speaker classification on clean
speech, using high energy subsets of test frames and dictionary atoms. We get the best utterance level
speaker classification accuracy of 100% for 30 speakers taken from TIMIT database on clean speech.
We have then dealt with noisy speech signals assuming a single speaker speaking in a noisy environment.
The noisy speech signals have been simulated at different SNRs using different noise and
speaker sources. We have classified the speaker and background noise class of the noisy speech signal
and subsequently separated the speech and noise components. Given a test noisy speech signal, a noise
label is assigned to a subset of frames selected using the SDR measure, and an accumulated measure is
used to classify the noise in the whole test signal. The speaker is classified using the proposed metric
of accumulated sum of weights on high energy features, estimated using ASNA with L1 regularization
from the concatenation of speaker dictionaries and the identified noise source dictionary. Using the
dictionaries of the identified speaker and noise source, we obtain the estimate of the separated speech
and noise signal using ASNA with L1 regularization and supervised non-negative matrix factorization
(NMF). We obtain around 98% accuracy for noise classification and 89% for speaker classification at
an SNR of 10 dB for a combination of 30 speakers and 15 noise sources.
In the case of an unknown noise, the noise source is estimated as the nearest known noise label.
The distribution of an unknown noise source amongst the known noise classes gives an indication
of the possible noise source. The dictionary corresponding to the estimated noise label is updated
adaptively using the features from the noise-only frames of the test signal. The updated dictionary
is then used for speaker classification, and subsequently separation is carried out. In the case of an
unknown speaker, the nearest speaker is estimated and the corresponding dictionary is updated using
a clean speech segment from the test signal. We assume that a clean speech segment is available
for adapting the speech dictionary. We have observed an improvement in signal to distortion ratio
(SDR) after separation of speech and noise components using an adaptive dictionary. Adaptive noise
dictionary gives an improvement of about 18% in speaker classification accuracy and 4 dB in SDR
over an out-of-set dictionary, after enhancement of noisy speech at an SNR of 0 dB.
In the case of a conversation, a divide and conquer algorithm is proposed to recursively estimate
the noise sources, and estimate the approximate instant of noise transition and the number of noise
types. We have then experimented on a conversation simulated by concatenating two different noise
signals, each containing speech segments of distinct speakers and obtained a mean absolute error in
the detection of noise transition instant of 10 ms at -10 dB SNR. Each of the segments obtained based
on the transition instant can be treated as a single noise mixed with speech from a single speaker and
subsequent speaker classification and source separation can be done as in the previous case.
We have also addressed the classification of speakers and subsequent separation of speakers in
overlapped speech, obtaining a mean speaker classification accuracy of 84% for the speaker 1 to
speaker 2 ratio (S1S2R) of 0 dB.
The advantage of the proposed dictionary learning and sparse representation based approach is
that the training and classification model is independent of the selected classes of speakers and noises.
Dictionaries for new classes can be easily added or the old classes can be removed or replaced instead
of retraining. Also, the same model can be used for identifying other types of classes like language
and gender. We have achieved speaker and noise classification and subsequent separation using only
spectral features for dictionary learning. This is in contrast to the stochastic model based approaches
where the model needs to be retrained whenever a new class is added.
In the unsupervised acoustic-phonetics knowledge based approach, we detect transitions between
broad phonetic classes in a speech signal which has applications such as landmark detection and segmentation.
The proposed rule based hierarchical method detects transitions from silence to non-silence,
sonorant to non-sonorant and vice-versa. We exploit the relative abrupt changes in the characteristics
of the speech signal to detect the transitions. Relative thresholds learnt from a small development set
are used to determine the parameter values. We propose different measures for detecting transitions
between broad phonetic classes in a speech signal based on abrupt amplitude changes. A measure
is defined on the quantized speech signal to detect transitions between very low amplitude or silence
(S) and non-silence (N) segments. The S-segments could be stop closures, pauses or silence regions
at the beginning and/or ending of an utterance. We propose two other measures to detect the transitions
between sonorant and non-sonorant segments and vice-versa. We make use of the fact that
most sonorants have higher energy in the low frequencies, than other phone classes such as unvoiced
fricatives, affricates and unvoiced stops. For this reason, we use a bandpass speech signal (60-340 Hz)
for extracting temporal features. A subset of the extrema (minimum or maximum amplitude samples)
between every pair of successive zero-crossings and above a threshold is selected from each frame of
the bandpass filtered speech signal. Occurrences of the first and the last extrema lie far before and
after the mid-point (reference) of a frame, if the speech signal belongs to a non-transition segment;
else, one of these locations lie within a few samples from the reference, indicating a transition frame.
The advantage of this approach is that it does not require significant training data for determining
the parameters of the proposed approach.
When tested on the entire TIMIT database for clean speech, of the transitions detected, 93.6%
are within a tolerance of 20 ms from the hand labeled boundaries. Sonorant, unvoiced non-sonorant
and silence classes and their respective onsets are detected with an accuracy of about 83.5% for the
same tolerance using the labelled TIMIT database as reference. The results are as good as, and in
some respects better than the state-of-the-art methods for similar tasks. The proposed method is
also tested on the test set of the TIMIT database for robustness with respect to white, babble and
Schroeder noise, and about 90% of the transitions are detected within the tolerance of 20 ms at the
signal to noise ratio of 5 dB.
We have also estimated glottal closure instants (GCIs) useful for a variety of applications such
as pitch and duration modification, speaking rate modification, pitch normalization, speech coding/
compression, and speaker normalization. The instant at which the vocal tract is significantly excited
within each glottal cycle in a speech signal is referred to as the epoch or the GCI. Subband analysis
of linear prediction residual (LPR) is proposed to estimate the GCIs from voiced speech segments. A
composite signal is derived as the sum of the envelopes of the subband components of the LPR signal.
Appropriately chosen peaks of the composite signal are the GCI candidates. The temporal locations
of the candidates are refined using the LPR to obtain the GCIs, which are validated against the GCIs
obtained from the electroglottograph signal, recorded simultaneously. The robustness is studied using
additive white, pink, blue, babble, vehicle, HF channel noises for different signal to noise ratios and
reverberation. The proposed method is evaluated using six different databases and compared with
three state-of-the-art LPR based methods. The GCI detection performance of the proposed algorithm
is quantified using the following measures: identification rate (IDR), miss rate (MR), false alarm rate
(FAR), standard deviation of error (SDE) and accuracy to 0.25 ms. We have shown that significant
GCI information exists in each subband of speech up to 2000 Hz, and a minimum of 89% identification
rate (for subbands other than lowpass) can be obtained for clean speech using the proposed method.
The results show that the performance of the proposed method is comparable to the best of the LPR
based techniques for clean, and noisy speech.