Automatic speech recognition for low-resource Indian languages
Abstract
Building good models for automatic speech recognition (ASR) requires large amounts of annotated speech data. Recent advancements in end-to-end speech recognition have aggravated the need for data. However, most Indian languages are low-resourced and lack enough training data to build robust and efficient ASR systems. Despite the challenges associated with the scarcity of data, Indian languages offer some unique characteristics that can be utilized to improve speech recognition in low-resource settings. Most languages have an overlapping phoneme set and a strong correspondence between their character sets and pronunciations. Though the writing systems are different, the Unicode tables are organized so that similar-sounding characters occur at the same offset in the range assigned for each language.
In the first part of the thesis, we try to exploit the pronunciation similarities among multiple Indian languages by using a shared set of pronunciation-based tokens. We evaluate the ASR performance for four choices of tokens, namely Epitran, Indian language speech sound label set (ILSL12), Sanskrit phonetic library encoding (SLP1), and SLP1-M (SLP1 modified to include some contextual pronunciation rules). Using Sanskrit as a representative Indian language, we conduct monolingual experiments to evaluate their ASR performance. Conventional Gaussian mixture model (GMM) - hidden Markov model (HMM) approaches, and neural network models leveraging on the alignments from the conventional models benefit from the stringent pronunciation modeling in SLP1-M. However, end-to-end (E2E) trained time-delay neural networks (TDNN) yield the best results with SLP1.
Most Indian languages are spoken in units of syllables. However, syllables have never been used for E2E speech recognition in the Indian language, to the best of our knowledge. So we compare token units like native script characters, SLP1, and syllables in the monolingual settings for multiple Indian languages. We also evaluate the performance of sub-word units generated with the byte pair encoding (BPE) and unigram language model (ULM) algorithms on these basic units. We find that syllable-based sub-word units are promising alternatives to graphemes in monolingual speech recognition if the dataset fairly covers the syllables in the language. The benefits of syllable sub-words in E2E speech recognition may be attributed to the reduced effective length of the token sequences. We also investigate if the models trained on different token units can complement each other in a pretraining-fine-tuning setup. However, the performance improvements in such a setup with syllable-BPE and SLP1 character tokens are minor compared to the syllable-BPE trained model. We also investigate the suitability of syllable-based units in a cross-lingual training setup for a low-resource target language. However, the model faces convergence issues. SLP1 characters are a better choice in crosslingual transfer learning than the syllable sub-words.
In the first part, we also verify the effectiveness of SpecAugment in an extremely low-resource setting. We apply SpecAugment on the log-mel spectrogram for data augmentation in a limited dataset of just 5.5 hours. The assumption is that the target language has no closely related high-resource source language, and only very limited data is available. SpecAugment provides an absolute improvement of 13.86% in WER on a connectionist temporal classification (CTC) based E2E system with weighted finite-state transducer (WFST) decoding. Based on this result, we extensively use SpecAugment in our experiments with E2E models.
In the second part of the thesis, we address the strategies for improving the performance of ASR systems in low-resource scenarios (target languages), exploiting the annotated data from high-resource languages (source languages). Based on the results in the first part of the thesis, we extensively use SLP1 tokens in multilingual experiments on E2E networks. We specifically explore the following settings:
(a) Labeled audio data is not available in the target language. Only a limited amount of unlabeled data is available. We propose using unsupervised domain adaptation (UDA) approaches in a hybrid DNN(deep neural network)-HMM setting to build ASR systems for low-resource languages sharing a common acoustic space with high-resource languages. We explore two architectures: i) domain adversarial training using gradient reversal layer (GRL) and ii) domain separation network (DSN). The GRL and DSN architectures give absolute improvements of 6.71% and 7.32%, respectively, in word error rate (WER) over the baseline DNN with Hindi in the source domain and Sanskrit in the target domain. We also find that a judicious selection of the source language yields further improvements.
(b) Target language has only a small amount of labeled data and has some amount of text data to build language models. We try to benefit from the available data in high-resource languages through a common shared label set to build unified acoustic (AM) and language models (LM). We study and compare the performance of these unified models with that of the monolingual model in low-resource conditions. The unified language-agnostic AM + LM performs better than monolingual AM + LM in cases where (a) only limited speech data is available for training the acoustic models and (b) the speech data is from domains different from that used in training. Multilingual AM + monolingual LM performs the best in general. However, from the results, applying unified models directly (without fine-tuning) to unseen languages does not seem to be a good choice.
(c) There are N target languages with limited training data and several source languages with large training sets. We explore the usefulness of model-agnostic meta-learning (MAML) pre-training for Indian languages and study the importance of selection of the source languages. We find that MAML beats joint multilingual pretraining by an average of 5.4% in CER and 20.3% in WER with just five epochs of fine-tuning. Moreover, MAML achieves performances similar to joint multilingual training with just 25% of the training data. Similarity with the source languages impacts the target language’s ASR performance. We propose a text-similarity-based loss-weighting scheme to exploit this artifact. We find absolute improvements of 1% (on average) in WER with the loss-weighing scheme.
The main contributions of the thesis are:
1. Finding that the use of SLP1 tokens as a common label set for Indian languages helps to remove the redundancy involved in pooling the characters from multiple languages.
2. Exploring for the first time (to the best of our knowledge) syllable-based token units for E2E speech recognition in Indian languages. We find that they are suitable only for monolingual ASR systems.
3. Formulating the ASR in a low-resource language lacking labeled data (for the first time) as an unsupervised domain adaptation problem from a related high-resource language.
4. Exploring for the first time both unified acoustic and language models in a multilingual ASR for Indian languages. The scheme has shown success in cases where the data for acoustic modeling is limited and in settings where the test data is out-of-domain.
5. Proposing a textual similarity-based loss-weighing scheme for MAML pretraining which improves the performance of vanilla MAML models.