|dc.description.abstract||Improving speech recognition performance in the presence of noise and interference continues to be a challenging problem. Automatic Speech Recognition (ASR) systems work well when the test and training conditions match. In real world environments there is often a mismatch between testing and training conditions. Various factors like additive noise, acoustic echo, and speaker accent, affect the speech recognition performance. Since ASR is a statistical pattern recognition problem, if the test patterns are unlike anything used to train the models, errors are bound to occur, due to feature vector mismatch. Various approaches to robustness have been proposed in the ASR literature contributing to mainly two topics: (i) reducing the variability in the feature vectors or (ii) modify the statistical model parameters to suit the noisy condition. While some of those techniques are quite effective, we would like to examine robustness from a different perspective. Considering the analogy of human communication over telephones, it is quite common to ask the person speaking to us, to repeat certain portions of their speech, because we don't understand it. This happens more often in the presence of background noise where the intelligibility of speech is affected significantly. Although exact nature of how humans decode multiple repetitions of speech is not known, it is quite possible that we use the combined knowledge of the multiple utterances and decode the unclear part of speech. Majority of ASR algorithms do not address this issue, except in very specific issues such as pronunciation modeling. We recognize that under very high noise conditions or bursty error channels, such as in packet communication where packets get dropped, it would be beneficial to take the approach of repeated utterances for robust ASR. In this thesis, we have formulated a set of algorithms for both joint evaluation/decoding for recognizing noisy test utterances as well as utilize the same formulation for selective training of Hidden Markov Models (HMMs), again for robust performance.
We first address joint recognition of multiple speech patterns given that they belong to the same class. We formulated this problem considering the patterns as isolated words. If there are K test patterns (K ≥ 2) of a word by a speaker, we show that it is possible to improve the speech recognition accuracy over independent single pattern evaluation of test speech, for the case of both clean and noisy speech. We also find the state sequence which best represents the K patterns. This formulation can be extended to connected word recognition or continuous speech recognition also.
Next, we consider the benefits of joint multi-pattern likelihood for HMM training. In the usual HMM training, all the training data is utilized to arrive at a best possible parametric model. But, it is possible that the training data is not all genuine and therefore may have labeling errors, noise corruptions, or plain outlier exemplars. Such outliers will result in poorer models and affect speech recognition performance. So it is important to selectively train them so that the outliers get a lesser weightage. Giving lesser weight to an entire outlier pattern has been addressed before in speech recognition literature. However, it is possible that only some portions of a training pattern are corrupted. So it is important that only the corrupted portions of speech are given a lesser weight during HMM training and not the entire pattern. Since in HMM training, multiple patterns of speech from each class are used, we show that it is possible to use joint evaluation methods to selectively train HMMs such that only the corrupted portions of speech are given a lesser weight and not the entire speech pattern.
Thus, we have addressed all the three main tasks of a HMM, to jointly utilize the availability of multiple patterns belonging to the same class. We experimented the new algorithms for Isolated Word Recognition in the case of both clean speech and noisy speech. Significant improvement in speech recognition performance is obtained, especially for speech affected by transient/burst noise.||en