Supervised Learning Approaches for Language and Speaker Recognition

Ramoji, Shreyas

dc.contributor.advisor	Ganapathy, Sriram
dc.contributor.author	Ramoji, Shreyas
dc.date.accessioned	2023-09-12T09:22:56Z
dc.date.available	2023-09-12T09:22:56Z
dc.date.submitted	2023
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/6217
dc.description.abstract	In the age of artificial intelligence, it is important for machines to figure out who is speaking automatically and in what language - a task humans are naturally capable of. Developing algorithms that automatically infer the speaker, language, or accent from a given speech segment are challenging problems that have been researched for over three decades. The main aim of this doctoral research was to explore and understand the shortcomings of existing approaches to the problems and propose novel supervised approaches to overcome these shortcomings to develop robust speaker and language recognition systems. In the first part of this doctoral research, we developed a supervised version of a popular embedding extraction approach called the i-vector, typically used as front-end embeddings for speaker and language recognition. In this approach, a database of speech recordings (in the form of a sequence of short-term feature vectors) is modeled with a Gaussian Mixture Model, called the Universal Background Model (GMM-UBM). The deviation in the mean components is captured in a lower dimensional latent space called the i-vector space using a factor analysis framework. In our research, we proposed a fully supervised version of the i-vector model, where each label class is associated with a Gaussian prior with a class-specific mean parameter. The joint prior (marginalized over the sample space of classes) on the latent variable becomes a GMM. The choice of the prior distribution is motivated by the Gaussian back-end, where the conventional i-vectors for each language are modeled with a single Gaussian distribution. With detailed data analysis and visualization, we show that the s-vector features yield representations that succinctly capture the language (accent) label information, and do a significantly better job distinguishing the various accents of the same language. We performed language recognition experiments on the NIST Language Recognition Evaluation (LRE) 2017 challenge dataset, which has test segments ranging from 3 to 30 seconds. With the s-vector framework, we observed relative improvements between 8% to 20% in terms of the Bayesian detection cost function, 4% to 24% in terms of EER, and 9% to 18% in terms of classification accuracy over the conventional i-vector framework. We also performed language recognition experiments on the RATS dataset and Mozilla CommonVoice dataset, and speaker classification experiments using LibriSpeech, demonstrating similar improvements. In the second part, we explored the problem of speaker verification, where a binary decision has to be made on a test speech segment as to whether it is spoken by a target speaker or not, based on a limited duration of enrollment speech. We proposed a neural network approach for back-end modeling. The likelihood ratio score of the generative PLDA model was posed as a discriminative similarity function, and the learnable parameters of the score function are optimized using a verification cost, proposed to be an approximation of the minimum detection cost function (DCF). The speaker recognition experiments using the NPLDA model are performed on the speaker verification task in the VOiCES datasets and the SITW challenge dataset. Further, we explore a fully neural approach where the neural model outputs the verification score directly, given the acoustic feature inputs. This Siamese neural network (E2E-NPLDA) model combines the embedding extraction and back-end modeling stages into a single processing pipeline. The development of the single neural Siamese model allows the joint optimization of all the modules using a verification cost. We provide a detailed analysis of the influence of hyper-parameters, choice of loss functions, and data sampling strategies for training these models. Several speaker recognition experiments were performed using Speakers in the Wild (SITW), VOiCES, and NIST SRE datasets where the proposed NPLDA and E2E-NPLDA models are shown to improve over the state-of-art significantly x-vector PLDA baseline system.	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	;ET00230
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	language recognition	en_US
dc.subject	speech recognition	en_US
dc.subject	Universal Background Model	en_US
dc.subject	i-vector space	en_US
dc.subject	Siamese neural network	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Information technology::Computer science	en_US
dc.title	Supervised Learning Approaches for Language and Speaker Recognition	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: shreyas_thesis_version2_after_ ...
Size:: 4.889Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Electrical Engineering (EE) [357]

Show simple item record