Pitch-Synchronous Discrete Cosine Transform Features for Speaker Recognition and Other Applications
Abstract
Extracting speaker-speci c information from the speech is of great interest since speaker recognition
technology nds application in a wide range of areas such as forensics and biometric
security systems. In this thesis, we propose a new feature named pitch-synchronous discrete
cosine transform (PS-DCT), derived from the voiced part of the speech for speaker identi cation
(SID) and veri cation (SV) tasks. Variants of the proposed PS-DCT features are explored
for other speech-based applications.
PS-DCT features are based on the `time-domain, quasi-periodic waveform shape' of the
voiced sounds, which is captured by the discrete cosine transform (DCT). We test our PS-DCT
feature on TIMIT, Mandarin and YOHO datasets for text-independent SID and SV studies.
On TIMIT with 168 and Mandarin with 855 speakers, we obtain the text-independent SID
accuracies of 99.4% and 96.1%, respectively, using a Gaussian mixture model-based classi er.
SV studies are performed using the i-vector based framework. To ensure good performance in
text-independent speaker veri cation, su cient training data for enrollment and longer utterances
as test data is required. In i-vector based SV, fusing the `PS-DCT based system' with
the `MFCC-based system' at the score level using a convex combination scheme reduces the
equal error rate (EER) for both YOHO and Mandarin datasets. In the case of limited test data
along with session variability, we obtain a signi cant reduction in equal error rate (EER). The
reduction in EER is up to 5.8% on YOHO database and 3.2% on the Mandarin dataset for test
data of duration < 3 sec. Thus, our experiments demonstrate that the proposed features supplement
the handcrafted classical features, such as MFCC. The improvement in performance is
more prominent in the case of limited test data speaker veri cation task.
As mentioned earlier, we have also explored the e cacy of the proposed features for other
speech-based tasks. Emotions in
uence both the voice characteristics and linguistic content of
speech. Speech emotion recognition (SER) is the task of extracting emotions from the voice
characteristics of speech. Since variations in pitch play an important role while expressing the
emotions through speech, we have tested PS-DCT features for emotion recognition. For the
SER experiments, we have used Berlin emotional speech database (EmoDB), which contains 535 utterances spoken by 10 actors (5 female, 5 male) in 7 simulated emotions (anger, boredom,
disgust, fear, joy, sadness and neutral). Bi-directional long short-term memory (BiLSTM) has
the ability to model the temporal dependencies of sequential data such as speech and it has
already been used for emotion recognition task. Hence, we have used a BiLSTM network
trained with conventional MFCC features at the front end as baseline. To provide an accurate
assessment of the model, we train and validate the model using leave-one-speaker-out (LOSO)
k-fold (k = 10 in this case) cross validation. In this method, we train using k-1 speakers and
then validate on the left-out speaker and repeat this procedure k times. The nal validation
accuracy is computed as the average of the k folds of cross-validation. Experiments show that
the BiLSTM network trained with combined PS-DCT and PS-MFCC features gives improved
performance over a network trained on regular MFCC. The absolute improvement in 10-fold
cross-validation accuracy is 3.5% using the fused (PS-DCT + PS-MFCC) features.
Since every class of the vowels has a xed temporal gross level dynamics over di erent
speakers and has quasi-stationary structure, we have explored the usefulness of PS-DCT to
perform the task of vowel recognition too. The task of vowel recognition from speech can be
viewed as a subset of phone recognition task. Since PS-DCT are de ned only for voiced sounds,
we have focused only on the vowel recognition task, since all the vowels are inherently voiced
in nature. For this study, we have used the vowel dataset available at https://homepages.
wmich.edu/~hillenbr/voweldata.html. This dataset has 12 vowels (/ae/ as in \had", /ah/
as in \hod", /aw/ as in \hawed", /eh/ as in \head", /er/ as in \heard", /ey/ as in \hayed",
/ih/ as in \hid", /iy/ as in \head", /oa/ as in \hoed", /oo/ as in \hood", /uh/ as in \hud",
/uw/ as in \who'd") recorded from 139 subjects covering both genders and a wide range of
age. It has utterances recorded from 45 men, 48 women and 46 children (both boys and girls).
In total, it has 1668 utterances covering all the 12 vowels (words like `had' and `hod'). The
results show that PS-DCT is able to classify vowels independently with 5-fold cross-validation
average accuracy of 73.3%. Using MFCC features at the front end of the BiLSTM network,
the obtained 5-fold cross-validation average accuracy is 88.5%, which is much better than that
obtained with the PS-DCT. Training the network with combined PS-DCT and PS-MFCC has
not led to improved performance, unlike the case of emotion recognition