Why only two ears? Some indicators from the study of source separation using two sensors
Abstract
In this thesis we develop algorithms for estimating broadband source signals from a mixture using only two sensors. This is motivated by what is known in the literature as cocktail party effect, the ability of human beings to listen to the desired source from a mixture of sources with at most two ears. Such a study lets us, achieve a better understanding of the auditory pathway in the brain and confirmation of the results from physiology and psychoacoustics, have a clue to search for an equivalent structure in the brain which corresponds to the modification which improves the algorithm, come up with a benchmark system to automate the evaluation of the systems like 'surround sound', perform speech recognition in noisy environments. Moreover, it is possible that, what we learn about the replication of the functional units in the brain may help us in replacing those using signal processing units for patients suffering due to the defects in these units.
There are two parts to the thesis. In the first part we assume the source signals to be broadband and having strong spectral overlap. Channel is assumed to have a few strong multipaths. We propose an algorithm to estimate all the strong multi-paths from each source to the sensors for more than two sources with measurement from two sensors. Because the channel matrix is not invertible when the number of sources is more than the number of sensors, we make use of the estimates of the multi-path delays for each source to improve the SIR of the sources. In the second part we look at a specific scenario of colored signals and channel being one with a prominent direct path. Speech signals as the sources in a weakly reverberant room and a pair of microphones as the sensors satisfy these conditions. We consider the case with and without a head like structure between the microphones. The head like structure we used was a cubical block of wood. We propose an algorithm for separating sources under such a scenario. We identify the features of speech and the channel which makes it possible for the human auditory system to solve the cocktail party problem. These properties are the same as that satisfied by our model. The algorithm works well in a partly acoustically treated room, (with three persons speaking and two microphones and data acquired using standard PC setup) and not so well in a heavily reverberant scenario.
We see that there are similarities in the processing steps involved in the algorithm and what we know of the way our auditory system works, especially so in the regions before the auditory cortex in the auditory pathway. Based on the above experiments we give reasons to support the hypothesis about why all the known organisms need to have only two ears and not more but may have more than two eyes to their advantage. Our results also indicate that part of pitch estimation for individual sources might be occurring in the brain after separating the individual source components. This might solve the dilemma of having to do multi-pitch estimation. Recent works suggest that there are parallel pathways in the brain up to the primary auditory cortex which deal with temporal cue based processing and spatial cue based processing. Our model seem to mimic the pathway which makes use of the spatial cues.