Speech Based Low-Complexity Classification of Patients with Amyotrophic Lateral Sclerosis from Healthy Controls: Exploring the Role of Hypernasality
Abstract
Amyotrophic Lateral Sclerosis (ALS) is a neurodegenerative disorder characterized
by motor neuron degeneration, leading to muscle weakness, atrophy,
and speech impairments. Dysarthria, an early symptom in approximately
30% of ALS patients, often presents with hypernasality due to
velopharyngeal dysfunction, observed in around 73.88% of individuals with
bulbar-onset ALS. These speech impairments significantly impact communication
and quality of life. Current ALS monitoring methods, such as clinical
assessments, genetic testing, electromyography (EMG), and magnetic
resonance imaging (MRI), are time-consuming and invasive. In contrast,
speech-based approaches provide a non-invasive and efficient alternative.
However, the lack of large ALS-specific speech datasets hinders model development.
This study aims to develop a simplified, low-complexity model
to distinguish ALS speech from healthy control (HC) speech, using hypernasality
as a key indicator, and avoiding the need for large ALS-specific
datasets.
The study begins by investigating hypernasality in ALS speech across
varying dysarthria severity, using HuBERT (Hidden Unit BERT) representations
and Mel-frequency cepstral coefficients (MFCC) features. Next,
the research focuses on simplifying deep learning models using the traditional
method of training on ALS and HC dataset, transitioning from
complex Convolutional Neural Networks (CNNs) and Bidirectional Long
Short-Term Memory (BiLSTM) models to simpler Deep Neural Networks
(DNNs) and Support Vector Machines (SVMs). These models are trained
using HuBERT representations, and MFCC and their derivatives (deltas
and double-deltas) as the feature, with various temporal statistics explored.
The individual components and coefficients of the MFCC and its derivatives
are also analyzed separately to reduce feature dimensionality and compuitational cost. The study also integrates hypernasality into the ALS vs. HC
classification by training a model for nasal vs. non-nasal phoneme classification
using healthy speech data. The model classifies ALS as the nasal
class and HC as the non-nasal class, demonstrating effectiveness in distinguishing
ALS speech from HC speech. Finally, the study analyzes classification
accuracies with and without using nasality, considering varying sizes
of ALS dataset. It explores the potential of nasality to provide reliable
classification results, particularly in cases where ALS data is limited.
The results show that nasality increases with disease severity, as observed
through both experimental results and perceptual analysis. Using
the traditional method and HuBERT representation as the feature, the
CNN-BiLSTM model achieves an average accuracy of 85.18% for Spontaneous
Speech (SPON) task and 85.21% for Diadochokinetic Rate (DIDK)
task, while the SVM model shows a decrease in accuracy by 7.87% for
SPON and 6.50% for DIDK, but the SVM requires significantly fewer
resources, with only 769 parameters and 1,536 floating-point operations
(FLOPs), compared to CNN-BiLSTM’s 1,761,032 parameters
and 2,840,000 FLOPs. MFCC features achieve similar accuracy to
HuBERT, with an average accuracy of 77.24% for SPON and 77.21%
for DIDK using 37 parameters and 72 FLOPs using SVM, compared
to HuBERT’s 769 parameters and 1,536 FLOPs. Dimensionality reduction
of MFCC minimizes complexity, with the individual delta and
double delta coefficients giving highest accuracy for the SVM model of
78.24% for SPON and 78.16% for DIDK, using only 2 parameters and
2 FLOPs. On using nasality as an indicator, for the SPON task, the
CNN-BiLSTM model achieves a maximum accuracy of 68.63%, while the
SVM model achieves 70.15% accuracy with much lower complexity (769
parameters and 1,536 FLOPs compared to 1,761,032 parameters and
2,840,000 FLOPs for CNN-BiLSTM). Similarly, for the DIDK task, the
CNN-BiLSTM reaches 80.74% accuracy, while DNN models and SVM provide
comparable accuracy with significantly reduced computational cost.
The nasality-based method maintains relatively stable accuracy across different
dataset sizes, outperforming the traditional method by 2-6% for
SPON and 2-10% for DIDK when using only 10% of the dataset, and
achieving up to a 3% improvement for DIDK with 40% of the data.

