Speech task-specific representation learning using acoustic-articulatory data

Mannem, Renuka

dc.contributor.advisor	Ghosh, Prasanta Kumar
dc.contributor.author	Mannem, Renuka
dc.date.accessioned	2020-07-07T06:14:30Z
dc.date.available	2020-07-07T06:14:30Z
dc.date.submitted	2020
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/4482
dc.description.abstract	Human speech production involves modulation of the air stream by the vocal tract shape determined by the articulatory configuration. Articulatory gestures are often used to represent the speech units. It has been shown that the articulatory representations contain information complementary to the acoustics. Thus, a speech task could benefit from the representations derived from both acoustic and articulatory data. The typical acoustic representations consist of spectral and temporal characteristics e.g., Mel Frequency Cepstral Coefficients (MFCCs) Line Spectral Frequencies (LSF), and Discrete Wavelet Transform (DWT). On the other hand, articulatory representations vary depending on how the articulatory movements are captured. For example, when Electro-Magnetic Articulography (EMA) is used, recorded raw movements of the EMA sensors placed on the tongue, jaw, upper lip, and lower lip and tract variables derived from them have often been used as articulatory representations. Similarly, when real-time Magnetic Resonance Imaging (rtMRI) is used, articulatory representations are derived primarily based on the Air-Tissue Boundaries (ATB) in the rtMRI video. The low resolution and SNR of the rtMRI videos make the ATB segmentation challenging. In this thesis, we propose various supervised ATB segmentation algorithms which include semantic segmentation, object contour detection using deep convolutional neural networks. The proposed approaches predict ATBs better than the existing baselines, namely, Maeda Grid and Fisher Discriminant Measure based schemes. We also propose a deep fully-connected neural network based ATB correction scheme as a post processing step to improve upon the predicted ATBs. However, articulatory data is not directly available in practice, unlike the speech recording. Thus, we also consider the articulatory representations derived from acoustics using an Acoustic-to-Articulatory Inversion (AAI) method. Generic acoustic and articulatory representations may not be optimal for a speech task. In this thesis, we consider the speech rate (SR) estimation task, useful for several speech applications and propose techniques for deriving acoustic and articulatory representations for the same. SR is defined as the number of syllables per second in a given speech recording. We propose a Convolutional Dense Neural Network (CDNN) to estimate the SR from directly given as well as learnt acoustic and articulatory representations. In the case of acoustics, the SR is estimated directly using MFCCs. When raw speech waveform is given as input, one-dimensional convolutional layers are utilized to estimate the SR specific acoustic representations. The center frequencies of the learned convolutional filters range from 200 to 1000 Hz unlike MFCC filter bank frequencies which range from 0 to 4000 Hz. The task-specific features are found to perform better in SR estimation compared to the MFCCs. The articulatory features also help in accurate speech rate estimation since the characteristics of articulatory motion significantly vary with the changes in the SR. To estimate the speech rate-specific articulatory representations, both the AAI and CDNN models are jointly trained using a weighted loss function which includes loss for the speech rate estimation and loss for estimating articulatory representations from acoustics. Similar to the acoustics case, the task-specific articulatory representations derived from acoustics perform better in speech rate estimation compared to the generic articulatory representations. Even though the task-specific articulatory representations derived from acoustics are not identical to the generic articulatory representations, both demonstrate low-pass characteristics. The CDNN based approach using both generic and learnt representations performs better than the temporal and selected subband correlation (TCSSBC) based baseline scheme for speech rate estimation task.	en_US
dc.language.iso	en_US	en_US
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	acoustics	en_US
dc.subject	articulatory data	en_US
dc.subject	task-specific representation learnining	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Electrical engineering, electronics and photonics::Other electrical engineering, electronics and photonics	en_US
dc.title	Speech task-specific representation learning using acoustic-articulatory data	en_US
dc.type	Thesis	en_US
dc.degree.name	MTech (Res)	en_US
dc.degree.level	Masters	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: Thesis_latest_version.pdf
Size:: 12.63Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Electrical Engineering (EE) [357]

Show simple item record