Acoustic-Articulatory Mapping: Analysis and Improvements with Neural Network Learning Paradigms
Abstract
Human speech is one of many acoustic signals we perceive, which carries linguistic and paralinguistic (e.g., speaker identity, emotional state) information. Speech acoustics are produced as a result of different temporally overlapping gestures of speech articulators (such as lips, tongue tip, tongue body, tongue dorsum, velum, and larynx), each of which regulates constriction in different parts of the vocal tract. Estimating speech acoustic representations from articulatory movements is known as articulatory- to-acoustic forward (AAF) mapping i.e., articulatory speech synthesis. While estimating articulatory movements back from the speech acoustics is known as acoustic-to-articulatory inverse (AAI) mapping. These acoustic- articulatory mapping functions are known to be complex and nonlinear.
The complexity of this mapping depends on a number of factors. These include the kind of representations used in the acoustic and articulatory spaces. Typically these representations capture both linguistic and paralinguistic aspects in speech. How each of these aspects contributes to the complexity of the mapping is unknown. These representations and, in turn, the acoustic-articulatory mapping are affected by the speaking rate as well. The nature and quality of the mapping vary across speakers. Thus, the complexity of mapping also depends on the amount of data from a speaker as well as the number of speakers used in learning the mapping function. Further, how the language variations impact the mapping requires detailed investigation. This thesis analyzes a few of such factors in detail and develops neural-network based models to learn mapping functions robust to many of these factors.
Electromagnetic articulography (EMA) sensor data has been used directly in the past as articulatory representations for learning the acoustic-articulatory mapping function. In this thesis, we address the problem of optimal EMA sensor placement such that the air-tissue boundaries as seen in the mid-sagittal plane of the real-time magnetic resonance imaging (rtMRI) are reconstructed with minimum error. Following optimal sensor placement work, acoustic-articulatory data was collected using EMA from 41 subjects with speech stimuli in English and Indian native languages (Hindi, Kannada, Tamil, and Telugu), resulting in a total of ∼23 hours of data, used in this thesis. Representations from raw waveform are also learned for AAI task using convolutional and bidirectional long short term memory neural networks (CNN-BLSTM), where the learned filters of CNN are found to be similar to those used for computing Mel-frequency cepstral coefficients (MFCCs), typically used for AAI task. In order to examine the extent to which a representation having only the linguistic information can recover articulatory representations, we replace MFCC vectors with one-hot encoded vectors representing phonemes, which were further modified to remove the time duration of each phoneme and keep only phoneme sequence. Experiments with phoneme sequence using attention network achieve an AAI performance that is identical to that using phoneme with timing information, while there is a drop in performance compared to that using MFCC.
Experiments to examine variation in speaking rate reveal that the errors in estimating the vertical motion of tongue articulators from acoustics with fast speaking rate are significantly higher than those with slow speaking rate. In order to reduce the demand for data from a speaker, low resource AAI is proposed using a transfer learning approach. Further, we show that AAI can be modeled to learn acoustic-articulatory mappings of multiple speakers through a single AAI model rather than building separate speaker-specific models. This is achieved by conditioning an AAI model with speaker embeddings, which benefits AAI in seen and unseen speaker evaluations. Finally, we show the benefit of estimated articulatory representations in voice conversion applications. Experiments revealed that articulatory representations estimated from speaker-independent AAI preserve linguistic information and suppress speaker-dependent factors. These articulatory representations (from an unseen speaker and language) are used to drive target speaker-specific AAF to synthesis speech, which preserves linguistic information and the target speaker’s voice characteristics.