Supervised Learning Approaches for Language and Speaker Recognition

Ramoji, Shreyas

View/Open

Thesis full text (4.889Mb)

Author

Ramoji, Shreyas

Metadata

Show full item record

Abstract

In the age of artificial intelligence, it is important for machines to figure out who is speaking automatically and in what language - a task humans are naturally capable of. Developing algorithms that automatically infer the speaker, language, or accent from a given speech segment are challenging problems that have been researched for over three decades. The main aim of this doctoral research was to explore and understand the shortcomings of existing approaches to the problems and propose novel supervised approaches to overcome these shortcomings to develop robust speaker and language recognition systems. In the first part of this doctoral research, we developed a supervised version of a popular embedding extraction approach called the i-vector, typically used as front-end embeddings for speaker and language recognition. In this approach, a database of speech recordings (in the form of a sequence of short-term feature vectors) is modeled with a Gaussian Mixture Model, called the Universal Background Model (GMM-UBM). The deviation in the mean components is captured in a lower dimensional latent space called the i-vector space using a factor analysis framework. In our research, we proposed a fully supervised version of the i-vector model, where each label class is associated with a Gaussian prior with a class-specific mean parameter. The joint prior (marginalized over the sample space of classes) on the latent variable becomes a GMM. The choice of the prior distribution is motivated by the Gaussian back-end, where the conventional i-vectors for each language are modeled with a single Gaussian distribution. With detailed data analysis and visualization, we show that the s-vector features yield representations that succinctly capture the language (accent) label information, and do a significantly better job distinguishing the various accents of the same language. We performed language recognition experiments on the NIST Language Recognition Evaluation (LRE) 2017 challenge dataset, which has test segments ranging from 3 to 30 seconds. With the s-vector framework, we observed relative improvements between 8% to 20% in terms of the Bayesian detection cost function, 4% to 24% in terms of EER, and 9% to 18% in terms of classification accuracy over the conventional i-vector framework. We also performed language recognition experiments on the RATS dataset and Mozilla CommonVoice dataset, and speaker classification experiments using LibriSpeech, demonstrating similar improvements. In the second part, we explored the problem of speaker verification, where a binary decision has to be made on a test speech segment as to whether it is spoken by a target speaker or not, based on a limited duration of enrollment speech. We proposed a neural network approach for back-end modeling. The likelihood ratio score of the generative PLDA model was posed as a discriminative similarity function, and the learnable parameters of the score function are optimized using a verification cost, proposed to be an approximation of the minimum detection cost function (DCF). The speaker recognition experiments using the NPLDA model are performed on the speaker verification task in the VOiCES datasets and the SITW challenge dataset. Further, we explore a fully neural approach where the neural model outputs the verification score directly, given the acoustic feature inputs. This Siamese neural network (E2E-NPLDA) model combines the embedding extraction and back-end modeling stages into a single processing pipeline. The development of the single neural Siamese model allows the joint optimization of all the modules using a verification cost. We provide a detailed analysis of the influence of hyper-parameters, choice of loss functions, and data sampling strategies for training these models. Several speaker recognition experiments were performed using Speakers in the Wild (SITW), VOiCES, and NIST SRE datasets where the proposed NPLDA and E2E-NPLDA models are shown to improve over the state-of-art significantly x-vector PLDA baseline system.

URI

https://etd.iisc.ac.in/handle/2005/6217

Collections

Electrical Engineering (EE) [423]