dc.description.abstract | The human respiratory system plays a crucial role in breathing and swallow
ing. However, it also plays an essential role in speech production, which is
unique to humans. Speech production involves expelling air from the lungs.
As the air flows from the lungs to the lips, some kinetic energy gets con
verted to sound. Different structures modulate the generated sound, which
is finally radiated out of the lips. The speech consists of various informa
tion such as linguistic content, speaker identity, emotional state, accent,
etc. Apart from speech, there are various scenarios where the sound is
generated in the human respiratory system. These could be due to abnor
malities in the muscles, motor control unit, or the lungs, which can directly
affect generated speech as well. A variety of sounds are also generated by
these structures while breathing including snoring, Stridor, Dysphagia, and
Cough.
The source filter (SF) model of speech is one of the earlier models of
speech production. It assumes that speech is a result of filtering an excita
tion or source signal by a linear filter. The source and filter are assumed
to be independent. Even though the SF model represents the speech pro
duction mechanism, there needs to be a tractable way of estimating the
excitation and the filter. The estimation of both of them given speech falls
under the general category of signal deconvolution problem, and, hence,
there is no unique solution. There are several variations of the source-filter
model in the literature by assuming different structures on the source/filter.
There are various ways to estimate the parameters of the source and the
filter. The estimated parameters are used in various speech applications
such as automatic speech recognition, text to speech, speech enhancement
etc. Even though the SF model is a model of speech production, it is used
in applications including Parkinson’s Disease classification, asthma classification.
The existing source filter models show much success in various appli
cations, however, we believe that the models mainly lack two respects.
The first limitation is that these models lack the connection to the physics
of sound generation or propagation. The second limitation of the cur
rent models is that they are not fully probabilistic. The inherent nature
of the airflow is stochastic because of the presence of turbulence. Hence,
probabilistic modeling is necessary to model the stochastic process. The
probabilistic models come with several other advantages: 1) systematically
inducing the prior knowledge into the models through probabilistic priors,
2) the estimation of the uncertainty of the model parameters, 3) allows
sampling of new data points 4) evaluation of the likelihood of the observed
speech.
We start with the governing equation of sound generation and use a
simplified geometry of the vocal folds. We show that the sound generated
by the vocal folds consists of two parts. The first part is because of the
difference between the subglottal and supra glottal pressure difference. The
second part is because of the sound generated by turbulence. The first kind
is dominant in the voiced sounds, and the second part is dominant in the
unvoiced sounds. We further assume the plane wave propagation in the vo
cal tract, and there is no feedback from the vocal tract on the vocal folds.
The resulting model is the excitation passing through an all-pole filter, and
the excitation is the sum of two signals. The first signal is quasi-periodic,
and the shape of each cycle depends on the time-varying area of the glottis.
The second part is stochastic because the turbulence is modeled as a white
noise passed through a filter. We further convert the model into a proba
bilistic one by assuming the following distribution on the excitations and
filters. We model the excitation using a Bernoulli Gaussian distribution.
Filter coefficients are modeled using the Gaussian distribution. The noise
distribution is also Gaussian. Given these distributions, the likelihood of
the speech can be derived as a closed-form expression. Similarly, we im
pose an appropriate prior to the model’s parameters and make a maximum
a posteriori (MAP) estimation of the parameters. The MAP estimation of
parameters can be computationally complex. But the model assumption
can be changed/approximated with respect to the application and result
ing in different estimation procedures. To validate the model, we apply this
model to seven applications as follows: 1. Analysis and Synthesis: This ap
plication is to understand the representation power of the model. 2. Robust
GCI detection: This shows the usefulness of estimated excitation, and the probabilistic modeling helps to incorporate the second-order statistics for
robust the excitation estimation. 3. Probabilistic glottal inverse filtering:
This application shows the usefulness of the prior distribution on filters. 4.
Neural speech synthesis: We show that the model’s reformulation with the
neural network results in a computationally efficient neural speech synthe
sis. 5. Prosthetic esophageal (PE) to normal speech conversion: We use
the probabilistic model for detecting the impulses in the noisy signal to
convert the PE speech to normal speech. 6. Robust essential vocal tremor
classification: The usefulness of robust excitation estimation in pathological
speech such as essential vocal tremor. 7. Snorer group classification: Based
on the analogy between voiced speech production and snore production, the
derived model is applicable for snore signals. We also use the parameter of
the model to classify the snorer groups. | en_US |