Neural Representation Learning for Speech and Audio Signals
Representation learning is the branch of machine learning consisting of techniques that are capable of automatically discovering meaningful representations from raw data for efficient information extraction. In recent years, following the trends in other streams of machine learning, representation learning using neural networks has attracted significant interest. For example, deep representation learning in the text domain using word embeddings has shown interesting semantic properties that make them widely useful for many natural language processing applications. In the speech processing field, representation learning has been a challenging task. This thesis is focused on developing neural methods for representation learning of speech and audio signals, with the goal of improving downstream applications that rely on these representations. For representation learning, we pursue two broad directions - supervised and unsupervised. In the case of speech/audio signals, we identify two stages of representation learning that are explored. The first stage is the learning of a time-frequency representation (the equivalent of spectrogram) from the raw audio waveform. The second stage is the learning of modulation representations (filtering the time-frequency representations along the temporal domain, called rate filtering and spectral domain, called scale filtering). In the first part of the thesis, we propose representation learning methods for speech data in an unsupervised manner. Using the modulation representation learning as the goal, we explore various neural architecture for unsupervised learning. These include restricted Boltzmann machines (RBM), variational autoencoders (VAE) and generative adversarial networks (GAN). For learning modulation representations that are distinct and irredundant, we propose different learning frameworks like external residual approach, skip connection based approach, and a modified cost function based approach. The methods developed for rate and scale representation learning are benchmarked using an automatic speech recognition (ASR) task on noisy and reverberant conditions. We also illustrate that the unsupervised representation learning can be extended to the first stage of learning time-frequency representations from raw waveforms. The second part of the thesis deals with supervised representation learning. Here, we propose a two-stage representation learning approach from raw waveform consisting of acoustic filterbank learning (time-frequency representation learning) from raw waveform followed by a modulation representation learning. This two-stage learning is directly optimized for the task at hand. The key novelty in the proposed framework consists of a relevance weighting mechanism that acts as a feature selection module. This is inspired by gating networks and provides a mechanism to weight the relevance of the acoustic and modulation representations for the task involved. The relevance weighting network can also utilize feedback from the previous predictions of the model for tasks like ASR. The proposed relevance weighting scheme is shown to provide significant performance improvements for ASR task and UrbanSound audio classification task. A detailed analysis yields insights into the interesting properties of the relevance weights that are captured by the model at the acoustic and modulation stages for speech and audio signals. In particular, the relevance weights are shown to succinctly capture phoneme characteristics in speech recognition tasks and the audio characteristics in the urban sound classification task. In summary, the thesis makes strides in the direction of unsupervised and supervised neural representation learning of speech and audio signals. While conventional methods of speech/audio processing involve deriving time-frequency spectrogram representations as the first step in most classification tasks, the work reported in the thesis argues that data driven representations from the raw signal with minimal assumptions can yield task specific flexibility and interpretability while also providing superior performance.