| dc.description.abstract | Recent advances in research and development in speaker recognition and identification systems have made speaker identification one of the most trusted methods for authorization and forensic applications. However, field deployment of such systems requires their ability to function effectively in noisy environments. Designing robust speaker identification systems for such conditions has gained significant attention from the research community and is the focus of this thesis.
In this work, we explore various dimensionality reduction techniques and their application to speaker identification. Principal Component Analysis (PCA), a coordinate-based dimensionality reduction method, plays a dominant role in this domain. By projecting the original feature set into a smaller subspace through a linear orthogonal transformation, PCA reduces both the dimensionality and the correlation among feature vectors. This transformation lowers computational overhead in subsequent processing stages and minimizes the effect of noise, thereby improving accuracy.
This thesis applies a feature-dependent dimensionality reduction technique known as Weighted Principal Component Analysis (WPCA). The key advantage of WPCA is its ability to merge coordinate-based and weight-based methods into a unified framework. Experimental results show an improvement of up to 3% in speaker identification accuracy using WPCA over PCA across various Signal-to-Noise Ratios (SNRs).
Selecting the optimal set of parameters is critical in dimensionality reduction. In speaker identification, the most significant components are those that clearly distinguish speech among individuals. We conducted experiments to identify the optimal parameters in the transformed feature vectors. In this thesis, 24-dimensional MFCC (Mel-Frequency Cepstral Coefficients) feature vectors are used, which are transformed into either PCA or WPCA space. In PCA space, each feature vector is divided into two parts: coefficients 1 to 12 correspond to higher eigenvalues forming Principal Component Features (PCF), and coefficients 13 to 24 correspond to lower eigenvalues forming Minor Component Features (MCF). Experimental and analytical results show that MCFs have greater discriminative power than PCFs.
Another significant contribution of this thesis is the extraction of latent features from the speech spectrum to enable automatic noise filtration. The proposed method applies Latent Variable Decomposition (LVD) to the magnitude spectral vector of the speech signal. In this method, the distribution of spectral vectors is modeled using a mixture multinomial distribution based on the a priori probability of a fixed number of hidden classes and the conditional frequency beam distribution. These form the transformation matrix used to generate new feature vectors. The number of hidden classes determines the dimensionality of the new feature vector. Since these features are inherently frequency-independent, noise effects are absorbed during this process.
These features are used in the candidate selection stage, where decisions are made based on Bhattacharyya Distance between speakers. Subsequently, Gaussian Mixture Models (GMM) are applied to the selected candidates using MFCC feature vectors. Results show that the proposed features yield up to 400% improvement in speaker identification rate over MFCC features at 10 dB SNR, demonstrating high effectiveness in noisy environments. | |