Show simple item record

dc.contributor.advisorGhosh, Prasanta Kumar
dc.contributor.authorYarra, Chiranjeevi
dc.date.accessioned2020-11-27T07:40:03Z
dc.date.available2020-11-27T07:40:03Z
dc.date.submitted2019
dc.identifier.urihttps://etd.iisc.ac.in/handle/2005/4697
dc.description.abstractSpoken English pronunciation quality is often influenced by the nativity of a learner, for whom English is the second language. Typically, the pronunciation quality of a learner depends on the degree of the following four sub-qualities: 1) phonemic quality, 2) syllable stress quality, 3) intonation quality, and 4) fluency. In order to achieve a good pronunciation quality, learners need to minimize their nativity influences in each of the four sub-qualities, which can be achieved with effective spoken English tutoring methods. However, these methods are expensive as they require highly proficient English experts. In cases where a cost-effective solution is required, it is useful to have a tutoring system which assesses a learner's pronunciation and provides feedback in each of the four sub-qualities to minimize nativity influences in a manner similar to that of a human expert. Such kind of systems are also useful for learners who can not access high quality tutoring due to their demographic and physical constraints. In this thesis, several methods are developed to assess pronunciation quality and provide feedback for such a spoken English tutoring system for Indian learners. Most of the existing works on automatic pronunciation assessment predict an overall pronunciation quality. However, feedback prediction has typically been done separately in each of the four sub-qualities. Both pronunciation assessment and feedback prediction require annotations on a large set of recordings from learners. While the former requires ratings for overall pronunciation quality, the latter needs feedback specific labeling. Unlike ratings, obtaining labels for feedback prediction requires highly skilled annotators. Such annotators are not available in large numbers and labeling with their expertise is also costly. Due to this paucity of labels, it is challenging to design a tutoring system in a cost effective manner particularly for Indian nativity, which is known for its large accent variabilities. With regard to these challenges, the key contributions in this thesis are: 1) building models for estimating parameters for providing meaningful feedback without using any labelled data, 2) building models for estimating overall pronunciation quality using annotated data, and 3) developing \textipa{voIs}TUTOR, a system for learners to train themselves with neutral accent of English with the help of a spoken English expert. The feedback prediction is semi-supervised in nature as no feedback-specific labels are used for building the feedback prediction models. Feedback in each of the four sub-qualities is predicted by analyzing mismatches in the respective parameters between a learners' and an expert's speech. In the phonemic category, phoneme errors made by a learner are provided as feedback, where the phonemes are estimated using rule based pronunciation dictionary. These rules are deduced from the errors made by the Indian learners while speaking English. For demonstrating the correct pronunciation, an articulatory video is synthesized using an expert's speech. Further, the effect of accents on the uttered phonemes is assessed using goodness of pronunciation measure, which is computed in a deep neural network-hidden Markov model (DNN-HMM) based automatic speech recognition (ASR) framework. In the stress category, mismatches in the estimated stressed syllable locations are provided as feedback. For this, stress-specific features are computed by exploring linguistic parameters, such as sonority, from every syllable when the ground truth syllable information is available. Its performance is analysed when the syllable information is estimated as in a real scenario. The stress locations are also estimated in an ASR framework without computing any stress-specific features. In the intonation category, feedback is provided based on the local and global mismatches in pitch patterns. For this, models are proposed to estimate the pitch values and their associated confidence scores. It is observed that the global mismatches depend on temporal variations in the pitch and its patterns. These mismatches are identified better when the confidence scores along with the pitch values are used in the models, based on HMMs and long-short term memory (LSTM) networks. Both the global and local mismatches are identified using knowledge driven template matching approach, that performs confidence score based median filtering and pitch stylization. In the fluency category, mismatches in the pause locations are provided as feedback. The pause locations are estimated using features based on speech acoustics only without considering any canonical stress markings because the learners' pronunciation do not often match the canonical pronunciation. Further, analysis is performed to estimate speech rate directly from the speech acoustics, where speech rate has been shown to be correlated with the fluency of a learner's pronunciation. Overall pronunciation rating is estimated using a joint model considering DNNs and LSTM networks. For this, studies are conducted to find out differences between the speech rhythm of Indian languages and that of English. Features based on speech rhythm are used for estimating the rating along with the features based on the parameters used for the feedback in all four sub-qualities. Further, in order to create an interactive learning environment in \textipa{voIs}TUTOR, these feedback and the ratings are displayed using audio-visual aids including line and bar graphs and text messages. All of these are made available in an android app using a web-server with LAMP (Linus, Apache, MySQL, PHP) stack on Ubuntu 14.04 LTS system.en_US
dc.language.isoen_USen_US
dc.rightsI grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertationen_US
dc.subjectSpeech Signal Processingen_US
dc.subjectPronunciation Assessmenten_US
dc.subjectFrame Selective Dynamic Programming (FSDP)en_US
dc.subjectSpoken English Tutoringen_US
dc.subject.classificationElectrical Engineeringen_US
dc.titlePronunciation assessment and semi-supervised feedback prediction for spoken English tutoringen_US
dc.typeThesisen_US
dc.degree.namePhDen_US
dc.degree.levelDoctoralen_US
dc.degree.grantorIndian Institute of Scienceen_US
dc.degree.disciplineEngineeringen_US


Files in this item

This item appears in the following Collection(s)

Show simple item record