Characterization and Enhancement of Dysarthric Speech for Amyotrophic Lateral Sclerosis: A Source-Filter Perspective

Bhattacharjee, Tanuka

View/Open

Thesis full text (14.89Mb)

Author

Bhattacharjee, Tanuka

Metadata

Show full item record

Abstract

Dysarthria is a motor speech disorder that affects different aspects of speech functions, like respiration, phonation, articulation, prosody, and resonance. It progressively compromises the intelligibility and naturalness of speech, making vocal communication extremely difficult for the affected individuals. Different types of neurological damage can lead to different varieties of dysarthria. In this thesis, we focus on dysarthria caused by the incurable neurodegenerative disease Amyotrophic lateral sclerosis (ALS). ALS is the most common motor neuron disease among adults, having a worldwide incidence of 1.59 per 1,00,000 person-years. About 30% of the ALS patients experience dysarthria as an early sign of the disease, while almost all patients develop it at some stage during disease progression. This thesis explores the source-filter angle of dysarthric speech specific to ALS for characterization and enhancement of this type of speech. The source-filter modelling captures the physiological processes of speech production, and hence, facilitates a detailed analysis of how ALS affects these processes. Thus, this direction of research can enhance our understanding of the characteristics and behavior of the disease, help in targeted automatic assessment and monitoring, as well as lead to the development of physiologically motivated intelligent assistive systems, like speech enhancement systems, for the patients. Though the impairments in different speech sub-systems and the resultant abnormalities in the speech occurring due to ALS-related dysarthria have been studied in the literature, few efforts have yet been made to analyze this type of dysarthric speech by exploiting the source-filter model. Dysarthria due to ALS is expected to affect both the source and the filter components of speech utterances. We first assess the relative contributions of different source- and filter-related attributes toward automatic classification between individuals with ALS and healthy controls (HC) using different speech tasks, namely, sustained vowels, sustained fricatives, diadochokinetic task, image descriptions, and spontaneous speech. The attributes achieving better classification performances are expected to capture those aspects of source and filter which are more critically affected in the disorder, and hence exhibit better discriminative properties. We explore decomposition-based source and filter cues obtained using an Iterative Adaptive Inverse Filtering (IAIF) based approach and a WORLD vocoder based approach. We also consider some standard and commonly used features related to the source, like harmonic-to-noise ratio (HNR), which are not necessarily estimated using inverse filtering or vocoders. Within the IAIF- and WORLD-based constructs, filter cues are found to exhibit better discriminative abilities than source cues in the cases of all speech tasks, except the sustained fricatives and three sustained vowels, e.g., /a/, /o/, and /u/. In those exceptional cases, both source and filter cues perform similarly. It is observed further that for sustained utterances of the close front vowel /i/, static cues of the source and filter (capturing the average source and filter configurations achieved during an utterance) provide better discriminative information than their dynamic counterparts (capturing the temporal variations in the source and filter configurations during an utterance), whereas, the reverse is true for sustained utterances of /a/, /o/, and /u/. For certain speech tasks, like the diadochokinetic task, some standard source cues, like HNR, seem to capture the impairment aspects more precisely, thereby achieving better classification performances than the IAIF- and WORLD-based source and filter cues. It is also observed that overall cues capturing information from both source and filter components together, like mel frequency cepstral coefficients (MFCC), do not provide performance benefits in ALS vs HC classification over the individual source and filter cues in most cases. We also analyze the robustness of different source, filter, and overall cues against the effects of different variants of noise for ALS vs HC classification in the absence/presence of the constraint of low complexity classifiers. This is essential for real-world diagnostic applications of these cues. Next, we analyze the discriminative abilities of the source- and filter-related cues of speech for automatic dysarthria severity classification for ALS. In this case, the filter-related cues are observed to present better discriminative information than source cues (IAIF- and WORLD-based ones as well as the standard ones) for all speech tasks except the sustained fricatives, where WORLD-based source cues outperform the other features. Thus, severity-wise changes in the filter component seem to be more pronounced than those in the source component in the cases of most of the speech tasks. Moreover, the overall cues are not found to provide performance benefits in dysarthria severity classification over the best performing individual source and filter cues in most cases. The primary challenge in developing dysarthria severity classification systems lies in the limited availability of data, particularly for more severe conditions. We propose to use different transfer learning approaches leveraging the source- and filter-related information by means of auxiliary tasks to mitigate the data scarcity issue. The transfer learning approaches are found to improve the dysarthria severity classification accuracy, especially for the mild dysarthria class. Another aspect of this thesis is to analyze the speaker-wise variations in different source and filter attributes of dysarthric speech for ALS. The timing and degree of involvement of different speech sub-systems in ALS-related dysarthria may vary across individuals. This, in turn, may lead to an increased degree of inter-speaker differences in particular source and filter characteristics as compared to those existing among the HCs. We study the source-level and filter-level inter-speaker differences during sustained vowel utterances at varied severities of ALS-related dysarthria. Among source attributes, jitter and standard deviation of fundamental frequency are found to exhibit enhanced inter-speaker differences among patients than HCs at all severity levels. Though inter-speaker differences in filter attributes at most severity levels are observed to be higher than those among HCs for the close vowels /i/ and /u/, these are comparable with or lower than those among HCs for the relatively more open vowels /a/ and /o/. The inter-speaker differences typically increase with severity. Lastly, we study the relative utility of enhancing the source and the filter component of the dysarthric speech, toward improving the naturalness and intelligibility of the utterances, while minimally affecting the speaker identity of the dysarthric subjects. We limit the study to the enhancement of sustained vowel utterances only. We observe that enhancing the filter component is more important than the source for improving the recognizability and naturalness of dysarthric sustained vowels, though enhancements of the two components are often found to complement each other. However, the speaker identity of the dysarthric subject is found to be more affected when the filter component is enhanced as compared to the source part. Moreover, the proposed enhancement approaches are found to shift the distributions of the critically impaired source and filter attributes, as identified during ALS vs HC classification, closer to their respective healthy spaces.

URI

https://etd.iisc.ac.in/handle/2005/7566

Collections

Electrical Engineering (EE) [451]