Speaker verification using whispered speech
Abstract
Like neutral speech, whispered speech is one of the natural modes of speech
production, and it is often used by speakers in their day-to-day life. For
some people, such as laryngectomees, whispered speech is the only mode
of communication. Despite the absence of voicing in whispered speech and
difference in characteristics compared to the neutral speech, previous works
in the literature demonstrated that whispered speech contains adequate information about the content and the speaker.
In recent times, virtual assistants have become more natural and
widespread. This led to an increase in the scenarios, where the device
has to detect the speech and verify the speaker even if the speaker whispers. Due to the noise-like characteristics, detecting whispered speech is a
challenge. On the other hand, a typical speaker verification system, where
neutral speech is used for enrolling the speakers but whispered speech for
testing, often performs poorly due to the difference in acoustic characteristics between the whispered and the neutral speech. Hence, the aim of this
thesis is two-fold: 1) develop a robust whisper activity detector specifically
for speaker verification task, 2) improve whispered speech based speaker
verification performance.
The contributions in this thesis lie in whisper activity detection as
well as whispered speech based speaker verification. It is shown how an
Attention-based average pooling in a speaker verification model can be
used to detect the whispered speech regions in noisy audio more accurately
than the best of the baseline schemes available. For improving speaker verification using whispered speech, we proposed features based on formant
gaps, and we showed that these features are more invariant to the modes of the speech compared to the best of the existing features. We also proposed
two feature mapping methods to convert the whispered features to neutral
features for speaker verification. In the first method, we introduced a novel
objective function, based on cosine similarity, for training a DNN, used for
feature mapping. In the second method, we iteratively optimized the feature mapping model using cosine similarity based objective function and the
total variability space likelihood in the i-vector based background model.
The proposed optimization provided a more reliable mapping from whispered features to neutral features resulting in an improvement of speaker
verification equal error rate by 44.8% (relative) over an existing DNN based
feature mapping scheme