Visual Speech Recognition
Abstract
Visual speech recognition (VSR), or automatic lip-reading, is the task of extracting speech
information from visual input. The addition of visual speech has been shown to improve
the performance of traditional audio speech recognition (ASR) systems, and hence has
been active area of research since it's inception. This thesis proposes a new VSR system
for isolated word recognition tasks, with focus on the feature extraction methodology. A
novel two-stage feature extraction technique is proposed. Image transform based features
{ discrete cosine transform (DCT) and local binary patterns (LBP) { are used. The use
of di erence images for temporal feature extraction is also proposed. A new region of
interest (ROI), which consists of the throat and lower jaw along with the mouth, is also
introduced. For ROI extraction, the Viola-Jones algorithm is used. Classi cation is done
using a multi-class Support Vector Machine (SVM) model. The system provides a simple,
yet effective way to extract features from the video input, and performs comparably to
some recent VSR systems, which employ more complicated techniques, like lip modelling
or deep learning, to extract visual features.