Show simple item record

dc.contributor.advisorSreenivas, T V
dc.contributor.authorSundar, Harshavardhan
dc.date.accessioned2016-09-15T11:46:20Z
dc.date.accessioned2018-07-31T04:48:46Z
dc.date.available2016-09-15T11:46:20Z
dc.date.available2018-07-31T04:48:46Z
dc.date.issued2016-09-15
dc.date.submitted2016
dc.identifier.urihttps://etd.iisc.ac.in/handle/2005/2569
dc.identifier.abstracthttp://etd.iisc.ac.in/static/etd/abstracts/3305/Harsha-Abs.pdfen_US
dc.description.abstractSpeech is by far the most natural form of communication between human beings. It is intuitive, expressive and contains information at several cognitive levels. We as humans, are perceptive to several of these cognitive levels of information, as we can gather the information pertaining to the identity of the speaker, the speaker's gender, emotion, location, the language, and so on, in addition to the content of what is being spoken. This makes speech based human machine interaction (HMI), both desirable and challenging for the same set of reasons. For HMI to be natural for humans, it is imperative that a machine understands information present in speech, at least at the level of speaker identity, language, location in space, and the summary of what is being spoken. Although one can draw parallels between the human-human interaction and HMI, the two differ in their purpose. We, as humans, interact with a machine, mostly in the context of getting a task done more efficiently, than is possible without the machine. Thus, typically in HMI, controlling the machine in a specific manner is the primary goal. In this context, it can be argued that, HMI, with a limited vocabulary containing specific commands, would suffice for a more efficient use of the machine. In this thesis, we address the problem of ``Who spoke what and where", in the context of a machine understanding the information pertaining to identities of the speakers, their locations in space and the keywords they spoke, thus considering three levels of information - speaker identity (who), location (where) and keywords (what). This can be addressed with the help of multiple sensors like microphones, video camera, proximity sensors, motion detectors, etc., and combining all these modalities. However, we explore the use of only microphones to address this issue. In practical scenarios, often there are times, wherein, multiple people are talking at the same time. Thus, the goal of this thesis is to detect all the speakers, their keywords, and their locations in mixture signals containing speech from simultaneous speakers. Addressing this problem of ``Who spoke what and where" using only microphone signals, forms a part of acoustic scene analysis (ASA) of speech based acoustic events. We divide the problem of ``who spoke what and where" into two sub-problems: ``Who spoke what?" and ``Who spoke where". Each of these problems is cast in a generic latent variable (LV) framework to capture information in speech at different levels. We associate a LV to represent each of these levels and model the relationship between the levels using conditional dependency. The sub-problem of ``who spoke what" is addressed using single channel microphone signal, by modeling the mixture signal in terms of LV mass functions of speaker identity, the conditional mass function of the keyword spoken given the speaker identity, and a speaker-specific-keyword model. The LV mass functions are estimated in a Maximum likelihood (ML) framework using the Expectation Maximization (EM) algorithm using Student's-t Mixture Model (tMM) as speaker-specific-keyword models. Motivated by HMI in a home environment, we have created our own database. In mixture signals, containing two speakers uttering the keywords simultaneously, the proposed framework achieves an accuracy of 82 % for detecting both the speakers and their respective keywords. The other sub-problem of ``who spoke where?" is addressed in two stages. In the first stage, the enclosure is discretized into sectors. The speakers and the sectors in which they are located are detected in an approach similar to the one employed for ``who spoke what" using signals collected from a Uniform Circular Array (UCA). However, in place of speaker-specific-keyword models, we use tMM based speaker models trained on clean speech, along with a simple Delay and Sum Beamformer (DSB). In the second stage, the speakers are localized within the active sectors using a novel region constrained localization technique based on time difference of arrival (TDOA). Since the problem being addressed is a multi-label classification task, we use the average Hamming score (accuracy) as the performance metric. Although the proposed approach yields an accuracy of 100 % in an anechoic setting for detecting both the speakers and their corresponding sectors in two-speaker mixture signals, the performance degrades to an accuracy of 67 % in a reverberant setting, with a $60$ dB reverberation time (RT60) of 300 ms. To improve the performance under reverberation, prior knowledge of the location of multiple sources is derived using a novel technique derived from geometrical insights into TDOA estimation. With this prior knowledge, the accuracy of the proposed approach improves to 91 %. It is worthwhile to note that, the accuracies are computed for mixture signals containing more than 90 % overlap of competing speakers. The proposed LV framework offers a convenient methodology to represent information at broad levels. In this thesis, we have shown its use with three different levels. This can be extended to several such levels to be applicable for a generic analysis of the acoustic scene consisting of broad levels of events. It will turn out that not all levels are dependent on each other and hence the LV dependencies can be minimized by independence assumption, which will lead to solving several smaller sub-problems, as we have shown above. The LV framework is also attractive to incorporate prior knowledge about the acoustic setting, which is combined with the evidence from the data to derive the information about the presence of an acoustic event. The performance of the framework, is dependent on the choice of stochastic models, which model the likelihood function of the data given the presence of acoustic events. However, it provides an access to compare and contrast the use of different stochastic models for representing the likelihood function.en_US
dc.language.isoen_USen_US
dc.subjectSignal Processingen_US
dc.subjectAcoustic Scene Analysis (ASA)en_US
dc.subjectLatent Variablesen_US
dc.subjectExpectation Maximizationen_US
dc.subjectEM Algorithmen_US
dc.subjectMixture Modelsen_US
dc.subjectAcoustic Source Localizationen_US
dc.subjectHyperboloidsen_US
dc.subjectStudent's-t Mixture Modelsen_US
dc.subjectGaussian Mixture Modelsen_US
dc.subjectRobust Mixture Modelingen_US
dc.subjectTime Difference of Arrival (TDOA)en_US
dc.subject.classificationCommunication Engineeringen_US
dc.titleWho Spoke What And Where? A Latent Variable Framework For Acoustic Scene Analysisen_US
dc.typeThesisen_US
dc.degree.namePhDen_US
dc.degree.levelDoctoralen_US
dc.degree.disciplineFaculty of Engineeringen_US


Files in this item

This item appears in the following Collection(s)

Show simple item record