Pitch-Synchronous Discrete Cosine Transform Features for  Speaker Recognition and Other Applications

Meghanani, Amit

View/Open

Thesis full text (987.1Kb)

Author

Meghanani, Amit

Metadata

Show full item record

Abstract

Extracting speaker-speci c information from the speech is of great interest since speaker recognition technology nds application in a wide range of areas such as forensics and biometric security systems. In this thesis, we propose a new feature named pitch-synchronous discrete cosine transform (PS-DCT), derived from the voiced part of the speech for speaker identi cation (SID) and veri cation (SV) tasks. Variants of the proposed PS-DCT features are explored for other speech-based applications. PS-DCT features are based on the `time-domain, quasi-periodic waveform shape' of the voiced sounds, which is captured by the discrete cosine transform (DCT). We test our PS-DCT feature on TIMIT, Mandarin and YOHO datasets for text-independent SID and SV studies. On TIMIT with 168 and Mandarin with 855 speakers, we obtain the text-independent SID accuracies of 99.4% and 96.1%, respectively, using a Gaussian mixture model-based classi er. SV studies are performed using the i-vector based framework. To ensure good performance in text-independent speaker veri cation, su cient training data for enrollment and longer utterances as test data is required. In i-vector based SV, fusing the `PS-DCT based system' with the `MFCC-based system' at the score level using a convex combination scheme reduces the equal error rate (EER) for both YOHO and Mandarin datasets. In the case of limited test data along with session variability, we obtain a signi cant reduction in equal error rate (EER). The reduction in EER is up to 5.8% on YOHO database and 3.2% on the Mandarin dataset for test data of duration < 3 sec. Thus, our experiments demonstrate that the proposed features supplement the handcrafted classical features, such as MFCC. The improvement in performance is more prominent in the case of limited test data speaker veri cation task. As mentioned earlier, we have also explored the e cacy of the proposed features for other speech-based tasks. Emotions in uence both the voice characteristics and linguistic content of speech. Speech emotion recognition (SER) is the task of extracting emotions from the voice characteristics of speech. Since variations in pitch play an important role while expressing the emotions through speech, we have tested PS-DCT features for emotion recognition. For the SER experiments, we have used Berlin emotional speech database (EmoDB), which contains 535 utterances spoken by 10 actors (5 female, 5 male) in 7 simulated emotions (anger, boredom, disgust, fear, joy, sadness and neutral). Bi-directional long short-term memory (BiLSTM) has the ability to model the temporal dependencies of sequential data such as speech and it has already been used for emotion recognition task. Hence, we have used a BiLSTM network trained with conventional MFCC features at the front end as baseline. To provide an accurate assessment of the model, we train and validate the model using leave-one-speaker-out (LOSO) k-fold (k = 10 in this case) cross validation. In this method, we train using k-1 speakers and then validate on the left-out speaker and repeat this procedure k times. The nal validation accuracy is computed as the average of the k folds of cross-validation. Experiments show that the BiLSTM network trained with combined PS-DCT and PS-MFCC features gives improved performance over a network trained on regular MFCC. The absolute improvement in 10-fold cross-validation accuracy is 3.5% using the fused (PS-DCT + PS-MFCC) features. Since every class of the vowels has a xed temporal gross level dynamics over di erent speakers and has quasi-stationary structure, we have explored the usefulness of PS-DCT to perform the task of vowel recognition too. The task of vowel recognition from speech can be viewed as a subset of phone recognition task. Since PS-DCT are de ned only for voiced sounds, we have focused only on the vowel recognition task, since all the vowels are inherently voiced in nature. For this study, we have used the vowel dataset available at https://homepages. wmich.edu/~hillenbr/voweldata.html. This dataset has 12 vowels (/ae/ as in \had", /ah/ as in \hod", /aw/ as in \hawed", /eh/ as in \head", /er/ as in \heard", /ey/ as in \hayed", /ih/ as in \hid", /iy/ as in \head", /oa/ as in \hoed", /oo/ as in \hood", /uh/ as in \hud", /uw/ as in \who'd") recorded from 139 subjects covering both genders and a wide range of age. It has utterances recorded from 45 men, 48 women and 46 children (both boys and girls). In total, it has 1668 utterances covering all the 12 vowels (words like `had' and `hod'). The results show that PS-DCT is able to classify vowels independently with 5-fold cross-validation average accuracy of 73.3%. Using MFCC features at the front end of the BiLSTM network, the obtained 5-fold cross-validation average accuracy is 88.5%, which is much better than that obtained with the PS-DCT. Training the network with combined PS-DCT and PS-MFCC has not led to improved performance, unlike the case of emotion recognition

URI

https://etd.iisc.ac.in/handle/2005/4822

Collections

Electrical Engineering (EE) [361]