Spectrotemporal Processing of Speech Signals Using the Riesz Transform
Abstract
Speech signals possess a rich time-varying spectral content, which makes their analysis a challenging signal processing problem. Developing methods for accurate speech analysis has a direct impact on applications such as speech synthesis, speaker recognition, speech recognition, voice morphing, etc. A widely used tool to visualize the time-varying spectral content is the spectrogram, which represents the spectral content of the signal in the joint time-frequency plane. A spectrogram can be viewed as a collection of several localized spectrotemporal patches. By analyzing the structure of two-dimensional (2-D) patterns in the spectrogram, we propose modeling it using 2-D amplitude-modulated and frequency-modulated (AM-FM) sinusoids. The justification for the 2-D AM-FM model for speech can be provided based on the physical process behind its generation. From a speech production perspective, the AM and FM components correspond to the vocal-tract smooth envelope and excitation signal, respectively. We demonstrate that analyzing speech jointly in time and frequency reveals several important characteristics, which are otherwise not evident either in purely time-domain or frequency-domain analysis.
The central problem in this dissertation is 2-D demodulation of a speech spectrogram, which yields 2-D AM and FM components. We advocate the use of the Riesz transform, which is a 2-D extension of the Hilbert transform, to demodulate narrowband and pitch adaptive spectrograms. Interestingly, the 2-D AM and FM components obtained as a result of demodulation have potential benefits for speech analysis. We demonstrate the impact of the proposed modeling technique for vocal tract filter estimation, voiced/unvoiced component separation, pitch tracking, speech synthesis, and periodic/aperiodic decomposition of speech signals. The accuracy of the estimated speech parameters is validated considering the task of speech reconstruction.
The first part of the thesis is focused on theoretical developments related to 2-D modeling. We consider prototypical 2-D cosine signals, analyze their Fourier transform properties, solve the problem of demodulation of a 2-D AM-FM cosine signal and extend the model to spectrotemporal patches. Following this, we examine the taxonomy of time-frequency patterns in the FM component, highlighting the salient attributes of different types of phonation in speech. We show that 2-D patterns specific to different speech sounds (voiced/unvoiced) can be captured by computing two novel time-frequency maps from the 2-D FM component: the coherencegram and orientationgram. The usefulness of the maps is demonstrated for the problem of periodic and aperiodic decomposition of speech signals.
In the second part, we use the FM component for estimating the source parameters. We show that the FM component is a rich representation of the source signal in 2-D and use it to estimate the speaker’s fundamental frequency (or pitch), speech aperiodicity, and voiced/unvoiced segmentation of the speech signal. We propose novel spectrotemporal features for voiced and unvoiced segmentation of speech. In contrast to time-domain features such as short-time energy, zero crossings, and autocorrelation coefficients, the proposed features are relatively insensitive to local variations of the speech waveform. The FM component is obtained by demodulating the narrowband speech spectrogram, which exhibits high frequency resolution. Consequently, the FM component encodes the speaker’s pitch. Hence, we propose methods for estimating the pitch from the FM component. Another critical component of a speech signal is its aperiodicity. Voiced sounds are quasi-periodic and have a noise component of strength relatively weaker than unvoiced sounds. Utilizing the time-frequency properties of the FM component, we propose methods for the estimation of speech aperiodicity.
While the FM component is used to estimate the source parameters, the 2-D AM component models the slowly varying vocal-tract filter. However, estimation of the vocal-tract filter is challenging due to its interaction with the quasi-periodic excitation. Two issues arise in this context: the first one is related to the length of the analysis window used for computing the spectrogram. We argue that a fixed-length analysis window is not ideal for vocal tract estimation. We show that the best results can be obtained by adapting the window length to the speaker’s pitch while computing the spectrogram. Such a spectrogram is referred to as the pitch-adaptive spectrogram. The second issue is related to the processing involved in demodulation, which has the undesirable effect of broadening the formant bandwidths. Hence, we propose a method to compensate for the formant broadening. It is crucial to estimate the optimum formant bandwidths as they determine the shape of the vocal tract filter and govern speech intelligibility during synthesis.
The effectiveness of the estimated source and filter parameters is shown by incorporating them in a spectral synthesis model and a neural vocoder for speech reconstruction. For neural vocoder, we use WaveNet, which is a deep generative model for audio generation. By conditioning the model on acoustic features, one can guide WaveNet to produce realistic speech waveforms. We use the Riesz transform-based acoustic features as conditional features in WaveNet vocoder. The quality of generated speech waveforms is evaluated by using objective and subjective measures.