Acoustic-Articulatory Mapping: Analysis and Improvements with Neural Network Learning Paradigms

Illa, Aravind

dc.contributor.advisor	Ghosh, Prasanta Kumar
dc.contributor.author	Illa, Aravind
dc.date.accessioned	2021-11-26T05:10:08Z
dc.date.available	2021-11-26T05:10:08Z
dc.date.submitted	2020
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/5525
dc.description.abstract	Human speech is one of many acoustic signals we perceive, which carries linguistic and paralinguistic (e.g., speaker identity, emotional state) information. Speech acoustics are produced as a result of different temporally overlapping gestures of speech articulators (such as lips, tongue tip, tongue body, tongue dorsum, velum, and larynx), each of which regulates constriction in different parts of the vocal tract. Estimating speech acoustic representations from articulatory movements is known as articulatory- to-acoustic forward (AAF) mapping i.e., articulatory speech synthesis. While estimating articulatory movements back from the speech acoustics is known as acoustic-to-articulatory inverse (AAI) mapping. These acoustic- articulatory mapping functions are known to be complex and nonlinear. The complexity of this mapping depends on a number of factors. These include the kind of representations used in the acoustic and articulatory spaces. Typically these representations capture both linguistic and paralinguistic aspects in speech. How each of these aspects contributes to the complexity of the mapping is unknown. These representations and, in turn, the acoustic-articulatory mapping are affected by the speaking rate as well. The nature and quality of the mapping vary across speakers. Thus, the complexity of mapping also depends on the amount of data from a speaker as well as the number of speakers used in learning the mapping function. Further, how the language variations impact the mapping requires detailed investigation. This thesis analyzes a few of such factors in detail and develops neural-network based models to learn mapping functions robust to many of these factors. Electromagnetic articulography (EMA) sensor data has been used directly in the past as articulatory representations for learning the acoustic-articulatory mapping function. In this thesis, we address the problem of optimal EMA sensor placement such that the air-tissue boundaries as seen in the mid-sagittal plane of the real-time magnetic resonance imaging (rtMRI) are reconstructed with minimum error. Following optimal sensor placement work, acoustic-articulatory data was collected using EMA from 41 subjects with speech stimuli in English and Indian native languages (Hindi, Kannada, Tamil, and Telugu), resulting in a total of ∼23 hours of data, used in this thesis. Representations from raw waveform are also learned for AAI task using convolutional and bidirectional long short term memory neural networks (CNN-BLSTM), where the learned filters of CNN are found to be similar to those used for computing Mel-frequency cepstral coefficients (MFCCs), typically used for AAI task. In order to examine the extent to which a representation having only the linguistic information can recover articulatory representations, we replace MFCC vectors with one-hot encoded vectors representing phonemes, which were further modified to remove the time duration of each phoneme and keep only phoneme sequence. Experiments with phoneme sequence using attention network achieve an AAI performance that is identical to that using phoneme with timing information, while there is a drop in performance compared to that using MFCC. Experiments to examine variation in speaking rate reveal that the errors in estimating the vertical motion of tongue articulators from acoustics with fast speaking rate are significantly higher than those with slow speaking rate. In order to reduce the demand for data from a speaker, low resource AAI is proposed using a transfer learning approach. Further, we show that AAI can be modeled to learn acoustic-articulatory mappings of multiple speakers through a single AAI model rather than building separate speaker-specific models. This is achieved by conditioning an AAI model with speaker embeddings, which benefits AAI in seen and unseen speaker evaluations. Finally, we show the benefit of estimated articulatory representations in voice conversion applications. Experiments revealed that articulatory representations estimated from speaker-independent AAI preserve linguistic information and suppress speaker-dependent factors. These articulatory representations (from an unseen speaker and language) are used to drive target speaker-specific AAF to synthesis speech, which preserves linguistic information and the target speaker’s voice characteristics.	en_US
dc.language.iso	en_US	en_US
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Acoustic-Articulatory Mapping	en_US
dc.subject	Neural Networks	en_US
dc.subject	Speech Production	en_US
dc.subject	Electromagnetic articulography	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Electrical engineering, electronics and photonics	en_US
dc.title	Acoustic-Articulatory Mapping: Analysis and Improvements with Neural Network Learning Paradigms	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: Aravind_merged_thesis.pdf
Size:: 16.55Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Electrical Engineering (EE) [448]

Show simple item record