Show simple item record

dc.contributor.advisorGhosh, Prasanta Kumar
dc.contributor.authorAchuth Rao, M V
dc.date.accessioned2021-12-16T05:03:58Z
dc.date.available2021-12-16T05:03:58Z
dc.date.submitted2020
dc.identifier.urihttps://etd.iisc.ac.in/handle/2005/5555
dc.description.abstractThe human respiratory system plays a crucial role in breathing and swallow ing. However, it also plays an essential role in speech production, which is unique to humans. Speech production involves expelling air from the lungs. As the air flows from the lungs to the lips, some kinetic energy gets con verted to sound. Different structures modulate the generated sound, which is finally radiated out of the lips. The speech consists of various informa tion such as linguistic content, speaker identity, emotional state, accent, etc. Apart from speech, there are various scenarios where the sound is generated in the human respiratory system. These could be due to abnor malities in the muscles, motor control unit, or the lungs, which can directly affect generated speech as well. A variety of sounds are also generated by these structures while breathing including snoring, Stridor, Dysphagia, and Cough. The source filter (SF) model of speech is one of the earlier models of speech production. It assumes that speech is a result of filtering an excita tion or source signal by a linear filter. The source and filter are assumed to be independent. Even though the SF model represents the speech pro duction mechanism, there needs to be a tractable way of estimating the excitation and the filter. The estimation of both of them given speech falls under the general category of signal deconvolution problem, and, hence, there is no unique solution. There are several variations of the source-filter model in the literature by assuming different structures on the source/filter. There are various ways to estimate the parameters of the source and the filter. The estimated parameters are used in various speech applications such as automatic speech recognition, text to speech, speech enhancement etc. Even though the SF model is a model of speech production, it is used in applications including Parkinson’s Disease classification, asthma classification. The existing source filter models show much success in various appli cations, however, we believe that the models mainly lack two respects. The first limitation is that these models lack the connection to the physics of sound generation or propagation. The second limitation of the cur rent models is that they are not fully probabilistic. The inherent nature of the airflow is stochastic because of the presence of turbulence. Hence, probabilistic modeling is necessary to model the stochastic process. The probabilistic models come with several other advantages: 1) systematically inducing the prior knowledge into the models through probabilistic priors, 2) the estimation of the uncertainty of the model parameters, 3) allows sampling of new data points 4) evaluation of the likelihood of the observed speech. We start with the governing equation of sound generation and use a simplified geometry of the vocal folds. We show that the sound generated by the vocal folds consists of two parts. The first part is because of the difference between the subglottal and supra glottal pressure difference. The second part is because of the sound generated by turbulence. The first kind is dominant in the voiced sounds, and the second part is dominant in the unvoiced sounds. We further assume the plane wave propagation in the vo cal tract, and there is no feedback from the vocal tract on the vocal folds. The resulting model is the excitation passing through an all-pole filter, and the excitation is the sum of two signals. The first signal is quasi-periodic, and the shape of each cycle depends on the time-varying area of the glottis. The second part is stochastic because the turbulence is modeled as a white noise passed through a filter. We further convert the model into a proba bilistic one by assuming the following distribution on the excitations and filters. We model the excitation using a Bernoulli Gaussian distribution. Filter coefficients are modeled using the Gaussian distribution. The noise distribution is also Gaussian. Given these distributions, the likelihood of the speech can be derived as a closed-form expression. Similarly, we im pose an appropriate prior to the model’s parameters and make a maximum a posteriori (MAP) estimation of the parameters. The MAP estimation of parameters can be computationally complex. But the model assumption can be changed/approximated with respect to the application and result ing in different estimation procedures. To validate the model, we apply this model to seven applications as follows: 1. Analysis and Synthesis: This ap plication is to understand the representation power of the model. 2. Robust GCI detection: This shows the usefulness of estimated excitation, and the probabilistic modeling helps to incorporate the second-order statistics for robust the excitation estimation. 3. Probabilistic glottal inverse filtering: This application shows the usefulness of the prior distribution on filters. 4. Neural speech synthesis: We show that the model’s reformulation with the neural network results in a computationally efficient neural speech synthe sis. 5. Prosthetic esophageal (PE) to normal speech conversion: We use the probabilistic model for detecting the impulses in the noisy signal to convert the PE speech to normal speech. 6. Robust essential vocal tremor classification: The usefulness of robust excitation estimation in pathological speech such as essential vocal tremor. 7. Snorer group classification: Based on the analogy between voiced speech production and snore production, the derived model is applicable for snore signals. We also use the parameter of the model to classify the snorer groups.en_US
dc.language.isoen_USen_US
dc.rightsI grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertationen_US
dc.subjectSpeech productionen_US
dc.subjectSounden_US
dc.subjectrobust excitationen_US
dc.subjectProsthetic esophagealen_US
dc.subject.classificationResearch Subject Categories::TECHNOLOGY::Electrical engineering, electronics and photonics::Electrical engineeringen_US
dc.titleProbabilistic source-filter model of speechen_US
dc.typeThesisen_US
dc.degree.namePhDen_US
dc.degree.levelDoctoralen_US
dc.degree.grantorIndian Institute of Scienceen_US
dc.degree.disciplineEngineeringen_US


Files in this item

This item appears in the following Collection(s)

Show simple item record