• Login
    View Item 
    •   etd@IISc
    • Division of Electrical, Electronics, and Computer Science (EECS)
    • Electrical Engineering (EE)
    • View Item
    •   etd@IISc
    • Division of Electrical, Electronics, and Computer Science (EECS)
    • Electrical Engineering (EE)
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Pitch-Synchronous Discrete Cosine Transform Features for Speaker Recognition and Other Applications

    View/Open
    Thesis full text (987.1Kb)
    Author
    Meghanani, Amit
    Metadata
    Show full item record
    Abstract
    Extracting speaker-speci c information from the speech is of great interest since speaker recognition technology nds application in a wide range of areas such as forensics and biometric security systems. In this thesis, we propose a new feature named pitch-synchronous discrete cosine transform (PS-DCT), derived from the voiced part of the speech for speaker identi cation (SID) and veri cation (SV) tasks. Variants of the proposed PS-DCT features are explored for other speech-based applications. PS-DCT features are based on the `time-domain, quasi-periodic waveform shape' of the voiced sounds, which is captured by the discrete cosine transform (DCT). We test our PS-DCT feature on TIMIT, Mandarin and YOHO datasets for text-independent SID and SV studies. On TIMIT with 168 and Mandarin with 855 speakers, we obtain the text-independent SID accuracies of 99.4% and 96.1%, respectively, using a Gaussian mixture model-based classi er. SV studies are performed using the i-vector based framework. To ensure good performance in text-independent speaker veri cation, su cient training data for enrollment and longer utterances as test data is required. In i-vector based SV, fusing the `PS-DCT based system' with the `MFCC-based system' at the score level using a convex combination scheme reduces the equal error rate (EER) for both YOHO and Mandarin datasets. In the case of limited test data along with session variability, we obtain a signi cant reduction in equal error rate (EER). The reduction in EER is up to 5.8% on YOHO database and 3.2% on the Mandarin dataset for test data of duration < 3 sec. Thus, our experiments demonstrate that the proposed features supplement the handcrafted classical features, such as MFCC. The improvement in performance is more prominent in the case of limited test data speaker veri cation task. As mentioned earlier, we have also explored the e cacy of the proposed features for other speech-based tasks. Emotions in uence both the voice characteristics and linguistic content of speech. Speech emotion recognition (SER) is the task of extracting emotions from the voice characteristics of speech. Since variations in pitch play an important role while expressing the emotions through speech, we have tested PS-DCT features for emotion recognition. For the SER experiments, we have used Berlin emotional speech database (EmoDB), which contains 535 utterances spoken by 10 actors (5 female, 5 male) in 7 simulated emotions (anger, boredom, disgust, fear, joy, sadness and neutral). Bi-directional long short-term memory (BiLSTM) has the ability to model the temporal dependencies of sequential data such as speech and it has already been used for emotion recognition task. Hence, we have used a BiLSTM network trained with conventional MFCC features at the front end as baseline. To provide an accurate assessment of the model, we train and validate the model using leave-one-speaker-out (LOSO) k-fold (k = 10 in this case) cross validation. In this method, we train using k-1 speakers and then validate on the left-out speaker and repeat this procedure k times. The nal validation accuracy is computed as the average of the k folds of cross-validation. Experiments show that the BiLSTM network trained with combined PS-DCT and PS-MFCC features gives improved performance over a network trained on regular MFCC. The absolute improvement in 10-fold cross-validation accuracy is 3.5% using the fused (PS-DCT + PS-MFCC) features. Since every class of the vowels has a xed temporal gross level dynamics over di erent speakers and has quasi-stationary structure, we have explored the usefulness of PS-DCT to perform the task of vowel recognition too. The task of vowel recognition from speech can be viewed as a subset of phone recognition task. Since PS-DCT are de ned only for voiced sounds, we have focused only on the vowel recognition task, since all the vowels are inherently voiced in nature. For this study, we have used the vowel dataset available at https://homepages. wmich.edu/~hillenbr/voweldata.html. This dataset has 12 vowels (/ae/ as in \had", /ah/ as in \hod", /aw/ as in \hawed", /eh/ as in \head", /er/ as in \heard", /ey/ as in \hayed", /ih/ as in \hid", /iy/ as in \head", /oa/ as in \hoed", /oo/ as in \hood", /uh/ as in \hud", /uw/ as in \who'd") recorded from 139 subjects covering both genders and a wide range of age. It has utterances recorded from 45 men, 48 women and 46 children (both boys and girls). In total, it has 1668 utterances covering all the 12 vowels (words like `had' and `hod'). The results show that PS-DCT is able to classify vowels independently with 5-fold cross-validation average accuracy of 73.3%. Using MFCC features at the front end of the BiLSTM network, the obtained 5-fold cross-validation average accuracy is 88.5%, which is much better than that obtained with the PS-DCT. Training the network with combined PS-DCT and PS-MFCC has not led to improved performance, unlike the case of emotion recognition
    URI
    https://etd.iisc.ac.in/handle/2005/4822
    Collections
    • Electrical Engineering (EE) [357]

    etd@IISc is a joint service of SERC & J R D Tata Memorial (JRDTML) Library || Powered by DSpace software || DuraSpace
    Contact Us | Send Feedback | Thesis Templates
    Theme by 
    Atmire NV
     

     

    Browse

    All of etd@IIScCommunities & CollectionsTitlesAuthorsAdvisorsSubjectsBy Thesis Submission DateThis CollectionTitlesAuthorsAdvisorsSubjectsBy Thesis Submission Date

    My Account

    LoginRegister

    etd@IISc is a joint service of SERC & J R D Tata Memorial (JRDTML) Library || Powered by DSpace software || DuraSpace
    Contact Us | Send Feedback | Thesis Templates
    Theme by 
    Atmire NV