Strategies for Handling Large Vocabulary and Data Sparsity Problems for Tamil Speech Recognition
Abstract
This thesis focuses on the design and development of every building block of a very large
vocabulary, continuous speech recognition (LVCSR) system and various experiments conducted
in order to enhance its performance. To our knowledge, this is the first report of development
of such a full-fledged Tamil LVCSR system, since we could not find any journal or even refereed
conference papers building a Tamil LVCSR system and reporting recognition results on large
scale open-source speech recognition test datasets. A large read speech corpus of 217 hours has
been collected and annotated at the sentence level for the development of the LVCSR system.
Out of this, 160 hours of data has been used for training the LVCSR, 50 hours as the test
set, and the publicly available 7 hours of OpenSLR-65 data released by Google is used as the
development set.
The major contributions of the thesis are:
• Collection of a large amount of Tamil speech and text data, editing the transcriptions to
match the spoken utterances and using them to develop a deep neural network (DNN)
and graphical model-based Tamil LVCSR system.
• Handling the unlimited vocabulary problem in Tamil by proposing subword modeling
technique using novel subword dictionary creation and word segmentation techniques
implemented efficiently using weighted finite state transducer (WFST) framework.
• Addressing the data sparsity problem by leveraging data from multiple low and medium
resourced languages by pooling data using novel phone/senone mapping techniques and
training a multitask DNN (MT-DNN).
• Proposing a novel coactivation loss for speaker-adapting the DNN using asymptotic
Bayesian approximation through Laplace approximation, by using mean and covariance
statistics of the activation values at all the hidden layers of the speaker-independent DNN.
• Studying the use of scattering transform features in acoustic modeling and proposing a DNN architecture inspired by it to jointly perform feature extraction and acoustic
modeling from raw speech signal.
The first part of the thesis addresses the most important problem of unlimited vocabulary
in Tamil and the effective techniques proposed based on subword modeling strategies to address
it. The morphology, agglutination and inflection properties of Tamil language give rise to the
problem of unlimited vocabulary. A graphical model based speech recognition system requires a
finite set of vocabulary; however, it is impossible to contain even the most commonly used words
in Tamil within a finite set. We propose and implement various techniques based on maximum
likelihood (using expectation-maximization algorithm) and Viterbi estimation techniques to
segment each word into its constituent subwords and use them in our recognition framework.
We have also used morphological analyzers and manually created graphical models for word
segmentation. Using these subword sequences, we construct lexicon and grammar graphs in a
specific way so that we address the unlimited vocabulary problem to a large extent with only a
limited amount of post-processing at the output stage of the recognition engine. The proposed
methods are highly effective in reducing the out-of-vocabulary words in the test corpus from
10.73% to 1.68% and reducing the word error rate from 24.7% to 12.31%.
In the second part of the thesis, we address the data sparsity problem using multilingual
training of DNN-based acoustic model by leveraging acoustic information from the transcribed
speech corpora of other languages. These experiments are carried out in a limited
resource setting, for which we have proposed two techniques. In the proposed data pooling
with phone/senone mapping technique, we train the DNN acoustic model with features and
senones from the target as well as the source languages by pooling them together. Since the
phone sets are different for the source and target languages, we map the phone set in the source
language to that of the target language before training the network. In the second approach
proposed, a MT-DNN is trained with features from the source as well as target languages to
predict the senones of each language in separate output layers. We modify the cross-entropy
loss function such that only those DNN layers just before the output layers that are responsible
for predicting the senones of a language are updated during training, if the feature vector belongs
to that language. The data pooling with DNN-based senone mapping and the MT-DNN
methods give relative improvements of 9.66% and 13.94% over the baseline system, respectively.
The third part of the thesis deals with the speaker adaptation challenge in a supervised
manner. We have proposed techniques based on asymptotic Bayesian approximation to derive
a loss function and use it to adapt the speaker-independent DNN (trained on multiple speaker
data) using a limited amount of speaker-specific adaptation data. The proposed loss function
uses the advantage of cross-entropy loss and it can as well be viewed as a generalization of
the well known center loss function. The speaker-adapted models thus obtained for individual
speakers reduces the WER from 17.33% to 13.97%.
In the final part of the thesis, we study the effect of using scattering transform in the feature
extraction stage of our ASR. From our experiments, we show that the scattering transform
based features perform better than traditional features like mel frequency cepstral coefficients
(MFCC) and log filterbank energies (LFBE). Motivated by this, we propose different DNN architectures
using one-dimensional (1-D) and two-dimensional (2-D) convolution layers to predict
the senones directly from the raw speech waveform. The convolution layers are initialized with
1-D and 2-D Gabor filterbank coefficients such that the intermediate layers of the DNN learn
a representation similar to that of scattering transform based features. The proposed DNN
architecture where the convolution layers are initialized with Gabor filter coefficients and are
updated during training gives the best WER of 12.21%, which is a relative improvement of
9.42% over the baseline DNN trained using LFBE features.