Improved air-tissue boundary segmentation in real-time magnetic resonance imaging videos using speech articulator specific error criterion
Abstract
Real-time Magnetic Resonance Imaging (rtMRI) is a tool used exhaustively in speech science and linguistics to understand the dynamics of the speech production process across languages and health conditions. rtMRI has two advantages over other methods which capture articulatory movement, like X-ray, Ultrasound and Electromagnetic articulography - it is non invasive, and it captures a complete view of the vocal tract including pharyngeal structures. The rtMRI video provides spatio-temporal information of speech articulatory movements, which helps in modeling speech production. For this purpose, a common step is to obtain the air-tissue boundary (ATB) segmentation in all frames of the rtMRI video. The accurate estimation of ATBs of the upper airway of the vocal tract is essential for many speech processing applications like speaker verification, text-to-speech synthesis, visual augmentation for synthesized articulatory videos, and analysis
of vocal tract movement. Thus, it is necessary to have an accurate air-tissue boundary segmentation in every frame of the rtMRI videos.
The best performance in ATB segmentation of rtMRI videos in speech production, in unseen subject conditions, is known to be achieved by a 3-dimensional convolutional neural network (3D-CNN) model. In seen subject conditions, both 3D-CNN and 2-dimensional deep convolutional encoder-decoder network (SegNet) show similar performance. However, the evaluation of these models, as well as other ATB segmentation techniques reported in literature, has been done using Dynamic Time Warping (DTW) distance between the entire original and predicted boundaries or contours. Such an evaluation measure may not capture local errors in the predicted contour. Careful analysis of predicted contours reveals
errors in regions like the velum part and tongue base section, which are not captured in a global evaluation metric like DTW distance. In this thesis, such errors are automatically detected and a novel correction scheme is proposed for them. Two new evaluation metrics are also proposed for ATB segmentation, separately for each contour, to explicitly capture errors in these contours.
Moreover, the state-of-the-art models use overall binary cross entropy as the loss function during model training. However, such a global loss function does not give enough emphasis on regions which are more prone to errors. In this thesis, together with global loss, the use of regional loss functions has been explored, which focus on areas of the contours which have been analyzed as error prone in the analysis. Two different losses are considered in the regions around velum and
tongue base - binary cross entropy (BCE) loss and dice loss. It is observed that dice-loss based models perform better than their BCE loss based counterparts.