Neural Representation Learning for Speech and Audio Signals

Agrawal, Purvi

dc.contributor.advisor	Ganapathy, Sriram
dc.contributor.author	Agrawal, Purvi
dc.date.accessioned	2021-01-27T04:18:34Z
dc.date.available	2021-01-27T04:18:34Z
dc.date.submitted	2020
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/4824
dc.description.abstract	Representation learning is the branch of machine learning consisting of techniques that are capable of automatically discovering meaningful representations from raw data for efficient information extraction. In recent years, following the trends in other streams of machine learning, representation learning using neural networks has attracted significant interest. For example, deep representation learning in the text domain using word embeddings has shown interesting semantic properties that make them widely useful for many natural language processing applications. In the speech processing field, representation learning has been a challenging task. This thesis is focused on developing neural methods for representation learning of speech and audio signals, with the goal of improving downstream applications that rely on these representations. For representation learning, we pursue two broad directions - supervised and unsupervised. In the case of speech/audio signals, we identify two stages of representation learning that are explored. The first stage is the learning of a time-frequency representation (the equivalent of spectrogram) from the raw audio waveform. The second stage is the learning of modulation representations (filtering the time-frequency representations along the temporal domain, called rate filtering and spectral domain, called scale filtering). In the first part of the thesis, we propose representation learning methods for speech data in an unsupervised manner. Using the modulation representation learning as the goal, we explore various neural architecture for unsupervised learning. These include restricted Boltzmann machines (RBM), variational autoencoders (VAE) and generative adversarial networks (GAN). For learning modulation representations that are distinct and irredundant, we propose different learning frameworks like external residual approach, skip connection based approach, and a modified cost function based approach. The methods developed for rate and scale representation learning are benchmarked using an automatic speech recognition (ASR) task on noisy and reverberant conditions. We also illustrate that the unsupervised representation learning can be extended to the first stage of learning time-frequency representations from raw waveforms. The second part of the thesis deals with supervised representation learning. Here, we propose a two-stage representation learning approach from raw waveform consisting of acoustic filterbank learning (time-frequency representation learning) from raw waveform followed by a modulation representation learning. This two-stage learning is directly optimized for the task at hand. The key novelty in the proposed framework consists of a relevance weighting mechanism that acts as a feature selection module. This is inspired by gating networks and provides a mechanism to weight the relevance of the acoustic and modulation representations for the task involved. The relevance weighting network can also utilize feedback from the previous predictions of the model for tasks like ASR. The proposed relevance weighting scheme is shown to provide significant performance improvements for ASR task and UrbanSound audio classification task. A detailed analysis yields insights into the interesting properties of the relevance weights that are captured by the model at the acoustic and modulation stages for speech and audio signals. In particular, the relevance weights are shown to succinctly capture phoneme characteristics in speech recognition tasks and the audio characteristics in the urban sound classification task. In summary, the thesis makes strides in the direction of unsupervised and supervised neural representation learning of speech and audio signals. While conventional methods of speech/audio processing involve deriving time-frequency spectrogram representations as the first step in most classification tasks, the work reported in the thesis argues that data driven representations from the raw signal with minimal assumptions can yield task specific flexibility and interpretability while also providing superior performance.	en_US
dc.language.iso	en_US	en_US
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Representation learning	en_US
dc.subject	Speech and audio signals	en_US
dc.subject	speech features	en_US
dc.subject	raw waveform	en_US
dc.subject	interpretable deep learning	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY	en_US
dc.title	Neural Representation Learning for Speech and Audio Signals	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: Purvi_Thesis_draft_revised.pdf
Size:: 8.311Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Electrical Engineering (EE) [357]

Show simple item record