Speech task-specific representation learning using acoustic-articulatory data
Abstract
Human speech production involves modulation of the air stream by the vocal tract shape determined by the articulatory configuration. Articulatory gestures are often used to represent the speech units. It has been shown that the articulatory representations contain information complementary to the acoustics. Thus, a speech task could benefit from the representations derived from both acoustic and articulatory data. The typical acoustic representations consist of spectral and temporal characteristics e.g., Mel Frequency Cepstral Coefficients (MFCCs) Line Spectral Frequencies (LSF), and Discrete Wavelet Transform (DWT). On the other hand, articulatory representations vary depending on how the articulatory movements are captured. For example, when Electro-Magnetic Articulography (EMA) is used, recorded raw movements of the EMA sensors placed on the tongue, jaw, upper lip, and lower lip and tract variables derived from them have often been used as articulatory representations. Similarly, when real-time Magnetic Resonance Imaging (rtMRI) is used, articulatory representations are derived primarily based on the Air-Tissue Boundaries (ATB) in the rtMRI video. The low resolution and SNR of the rtMRI videos make the ATB segmentation challenging. In this thesis, we propose various supervised ATB segmentation algorithms which include semantic segmentation, object contour detection using deep convolutional neural networks. The proposed approaches predict ATBs better than the existing baselines, namely, Maeda Grid and Fisher Discriminant Measure based schemes. We also propose a deep fully-connected neural network based ATB correction scheme as a post processing step to improve upon the predicted ATBs. However, articulatory data is not directly available in practice, unlike the speech recording. Thus, we also consider the articulatory representations derived from acoustics using an Acoustic-to-Articulatory Inversion (AAI) method. Generic acoustic and articulatory representations may not be optimal for a speech task. In this thesis, we consider the speech rate (SR) estimation task, useful for several speech applications and propose techniques for deriving acoustic and articulatory representations for the same. SR is defined as the number of syllables per second in a given speech recording. We propose a Convolutional Dense Neural Network (CDNN) to estimate the SR from directly given as well as learnt acoustic and articulatory representations. In the case of acoustics, the SR is estimated directly using MFCCs. When raw speech waveform is given as input, one-dimensional convolutional layers are utilized to estimate the SR specific acoustic representations. The center frequencies of the learned convolutional filters range from 200 to 1000 Hz unlike MFCC filter bank frequencies which range from 0 to 4000 Hz. The task-specific features are found to perform better in SR estimation compared to the MFCCs. The articulatory features also help in accurate speech rate estimation since the characteristics of articulatory motion significantly vary with the changes in the SR. To estimate the speech rate-specific articulatory representations, both the AAI and CDNN models are jointly trained using a weighted loss function which includes loss for the speech rate estimation and loss for estimating articulatory representations from acoustics. Similar to the acoustics case, the task-specific articulatory representations derived from acoustics perform better in speech rate estimation compared to the generic articulatory representations. Even though the task-specific articulatory representations derived from acoustics are not identical to the generic articulatory representations, both demonstrate low-pass characteristics. The CDNN based approach using both generic and learnt representations performs better than the temporal and selected subband correlation (TCSSBC) based baseline scheme for speech rate estimation task.