Strategies for Handling Large Vocabulary and Data Sparsity Problems for Tamil Speech Recognition

Madhavaraj, A

View/Open

Thesis full text (6.456Mb)

Author

Madhavaraj, A

Metadata

Show full item record

Abstract

This thesis focuses on the design and development of every building block of a very large vocabulary, continuous speech recognition (LVCSR) system and various experiments conducted in order to enhance its performance. To our knowledge, this is the first report of development of such a full-fledged Tamil LVCSR system, since we could not find any journal or even refereed conference papers building a Tamil LVCSR system and reporting recognition results on large scale open-source speech recognition test datasets. A large read speech corpus of 217 hours has been collected and annotated at the sentence level for the development of the LVCSR system. Out of this, 160 hours of data has been used for training the LVCSR, 50 hours as the test set, and the publicly available 7 hours of OpenSLR-65 data released by Google is used as the development set. The major contributions of the thesis are: • Collection of a large amount of Tamil speech and text data, editing the transcriptions to match the spoken utterances and using them to develop a deep neural network (DNN) and graphical model-based Tamil LVCSR system. • Handling the unlimited vocabulary problem in Tamil by proposing subword modeling technique using novel subword dictionary creation and word segmentation techniques implemented efficiently using weighted finite state transducer (WFST) framework. • Addressing the data sparsity problem by leveraging data from multiple low and medium resourced languages by pooling data using novel phone/senone mapping techniques and training a multitask DNN (MT-DNN). • Proposing a novel coactivation loss for speaker-adapting the DNN using asymptotic Bayesian approximation through Laplace approximation, by using mean and covariance statistics of the activation values at all the hidden layers of the speaker-independent DNN. • Studying the use of scattering transform features in acoustic modeling and proposing a DNN architecture inspired by it to jointly perform feature extraction and acoustic modeling from raw speech signal. The first part of the thesis addresses the most important problem of unlimited vocabulary in Tamil and the effective techniques proposed based on subword modeling strategies to address it. The morphology, agglutination and inflection properties of Tamil language give rise to the problem of unlimited vocabulary. A graphical model based speech recognition system requires a finite set of vocabulary; however, it is impossible to contain even the most commonly used words in Tamil within a finite set. We propose and implement various techniques based on maximum likelihood (using expectation-maximization algorithm) and Viterbi estimation techniques to segment each word into its constituent subwords and use them in our recognition framework. We have also used morphological analyzers and manually created graphical models for word segmentation. Using these subword sequences, we construct lexicon and grammar graphs in a specific way so that we address the unlimited vocabulary problem to a large extent with only a limited amount of post-processing at the output stage of the recognition engine. The proposed methods are highly effective in reducing the out-of-vocabulary words in the test corpus from 10.73% to 1.68% and reducing the word error rate from 24.7% to 12.31%. In the second part of the thesis, we address the data sparsity problem using multilingual training of DNN-based acoustic model by leveraging acoustic information from the transcribed speech corpora of other languages. These experiments are carried out in a limited resource setting, for which we have proposed two techniques. In the proposed data pooling with phone/senone mapping technique, we train the DNN acoustic model with features and senones from the target as well as the source languages by pooling them together. Since the phone sets are different for the source and target languages, we map the phone set in the source language to that of the target language before training the network. In the second approach proposed, a MT-DNN is trained with features from the source as well as target languages to predict the senones of each language in separate output layers. We modify the cross-entropy loss function such that only those DNN layers just before the output layers that are responsible for predicting the senones of a language are updated during training, if the feature vector belongs to that language. The data pooling with DNN-based senone mapping and the MT-DNN methods give relative improvements of 9.66% and 13.94% over the baseline system, respectively. The third part of the thesis deals with the speaker adaptation challenge in a supervised manner. We have proposed techniques based on asymptotic Bayesian approximation to derive a loss function and use it to adapt the speaker-independent DNN (trained on multiple speaker data) using a limited amount of speaker-specific adaptation data. The proposed loss function uses the advantage of cross-entropy loss and it can as well be viewed as a generalization of the well known center loss function. The speaker-adapted models thus obtained for individual speakers reduces the WER from 17.33% to 13.97%. In the final part of the thesis, we study the effect of using scattering transform in the feature extraction stage of our ASR. From our experiments, we show that the scattering transform based features perform better than traditional features like mel frequency cepstral coefficients (MFCC) and log filterbank energies (LFBE). Motivated by this, we propose different DNN architectures using one-dimensional (1-D) and two-dimensional (2-D) convolution layers to predict the senones directly from the raw speech waveform. The convolution layers are initialized with 1-D and 2-D Gabor filterbank coefficients such that the intermediate layers of the DNN learn a representation similar to that of scattering transform based features. The proposed DNN architecture where the convolution layers are initialized with Gabor filter coefficients and are updated during training gives the best WER of 12.21%, which is a relative improvement of 9.42% over the baseline DNN trained using LFBE features.

URI

https://etd.iisc.ac.in/handle/2005/4759

Collections

Electrical Engineering (EE) [412]