Show simple item record

dc.contributor.advisorGhosh, Prasanta Kumar
dc.contributor.authorMannem, Renuka
dc.date.accessioned2020-07-07T06:14:30Z
dc.date.available2020-07-07T06:14:30Z
dc.date.submitted2020
dc.identifier.urihttps://etd.iisc.ac.in/handle/2005/4482
dc.description.abstractHuman speech production involves modulation of the air stream by the vocal tract shape determined by the articulatory configuration. Articulatory gestures are often used to represent the speech units. It has been shown that the articulatory representations contain information complementary to the acoustics. Thus, a speech task could benefit from the representations derived from both acoustic and articulatory data. The typical acoustic representations consist of spectral and temporal characteristics e.g., Mel Frequency Cepstral Coefficients (MFCCs) Line Spectral Frequencies (LSF), and Discrete Wavelet Transform (DWT). On the other hand, articulatory representations vary depending on how the articulatory movements are captured. For example, when Electro-Magnetic Articulography (EMA) is used, recorded raw movements of the EMA sensors placed on the tongue, jaw, upper lip, and lower lip and tract variables derived from them have often been used as articulatory representations. Similarly, when real-time Magnetic Resonance Imaging (rtMRI) is used, articulatory representations are derived primarily based on the Air-Tissue Boundaries (ATB) in the rtMRI video. The low resolution and SNR of the rtMRI videos make the ATB segmentation challenging. In this thesis, we propose various supervised ATB segmentation algorithms which include semantic segmentation, object contour detection using deep convolutional neural networks. The proposed approaches predict ATBs better than the existing baselines, namely, Maeda Grid and Fisher Discriminant Measure based schemes. We also propose a deep fully-connected neural network based ATB correction scheme as a post processing step to improve upon the predicted ATBs. However, articulatory data is not directly available in practice, unlike the speech recording. Thus, we also consider the articulatory representations derived from acoustics using an Acoustic-to-Articulatory Inversion (AAI) method. Generic acoustic and articulatory representations may not be optimal for a speech task. In this thesis, we consider the speech rate (SR) estimation task, useful for several speech applications and propose techniques for deriving acoustic and articulatory representations for the same. SR is defined as the number of syllables per second in a given speech recording. We propose a Convolutional Dense Neural Network (CDNN) to estimate the SR from directly given as well as learnt acoustic and articulatory representations. In the case of acoustics, the SR is estimated directly using MFCCs. When raw speech waveform is given as input, one-dimensional convolutional layers are utilized to estimate the SR specific acoustic representations. The center frequencies of the learned convolutional filters range from 200 to 1000 Hz unlike MFCC filter bank frequencies which range from 0 to 4000 Hz. The task-specific features are found to perform better in SR estimation compared to the MFCCs. The articulatory features also help in accurate speech rate estimation since the characteristics of articulatory motion significantly vary with the changes in the SR. To estimate the speech rate-specific articulatory representations, both the AAI and CDNN models are jointly trained using a weighted loss function which includes loss for the speech rate estimation and loss for estimating articulatory representations from acoustics. Similar to the acoustics case, the task-specific articulatory representations derived from acoustics perform better in speech rate estimation compared to the generic articulatory representations. Even though the task-specific articulatory representations derived from acoustics are not identical to the generic articulatory representations, both demonstrate low-pass characteristics. The CDNN based approach using both generic and learnt representations performs better than the temporal and selected subband correlation (TCSSBC) based baseline scheme for speech rate estimation task.en_US
dc.language.isoen_USen_US
dc.rightsI grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertationen_US
dc.subjectacousticsen_US
dc.subjectarticulatory dataen_US
dc.subjecttask-specific representation learniningen_US
dc.subject.classificationResearch Subject Categories::TECHNOLOGY::Electrical engineering, electronics and photonics::Other electrical engineering, electronics and photonicsen_US
dc.titleSpeech task-specific representation learning using acoustic-articulatory dataen_US
dc.typeThesisen_US
dc.degree.nameMTech (Res)en_US
dc.degree.levelMastersen_US
dc.degree.grantorIndian Institute of Scienceen_US
dc.degree.disciplineEngineeringen_US


Files in this item

This item appears in the following Collection(s)

Show simple item record