Show simple item record

dc.contributor.advisorGanapathy, Sriram
dc.contributor.advisorArun, S P
dc.contributor.authorHarjpal, Chandrakant
dc.date.accessioned2024-06-13T09:38:59Z
dc.date.available2024-06-13T09:38:59Z
dc.date.submitted2023
dc.identifier.urihttps://etd.iisc.ac.in/handle/2005/6527
dc.description.abstractWe easily learn audiovisual associations when we give visual objects their names. While humans easily learn the names of new objects while retaining previously learned information, deep neural networks forget old associations when trained with new ones, a phenomenon called catastrophic forgetting. In this thesis, I have performed two studies to characterize human and deep network performance on learning novel audiovisual associations. In Study 1, we compared humans and a multimodal deep network performance on learning novel object-word associations and decay in performance of initially encountered pairs after they learn more pairs. We selected 60 object-word pairs from the Novel Object and Unusual Name (NOUN) dataset and performed equivalent experiments on both humans and deep networks. In the human experiments, participants performed 6 sessions of learning and testing novel object-word associations. In each session, they were asked to memorize 10 novel object-word pairs each time and were tested by presenting them with the spoken word (in a different voice/accent) and were asked to identify the associated image (in a different color/orientation) among the all the object images encountered in the session. This test was performed for some objects immediately and for others after a new learning session. Human accuracy was 59% on immediate test and decreased only slightly when tested after larger intervals. Participant accuracy was significantly smaller on delayed test compared to the immediate test. In the deep network experiment, we used audiovisual network with image and audio subnetwork and did triplet loss training to learn object-word associations in the same way as the human experiments. On each session, the network was trained on 10 object-word pairs and tested by presenting an audio word and finding the nearest matching image among the full set. We evaluated two scenarios: a vanilla network with no constraint on weights, a network with elastic weight consolidation (EWC). We found a decrease in performance of initially encountered pairs after network was trained with new pairs in vanilla setting but improved with Elastic weight consolidation method. We matched Current sessions accuracies with human performance to compare on forgetting performance and saw Vanilla network is worse than human performance, but EWC was performing better than humans. In Study 2, we investigated if there is an order preference between Images and audio modality during learning of audiovisual association in Humans, i.e., if the image is presented before audio, is it better than audio first? We distributed pairs with different learning conditions which varied in either order of image and audio presentation or delay between end of one modality and start of presentation of second modality, the time delay values were 0 ms, 500 ms and 1000 ms. The pairs are shown in different learning conditions to subjects and then tested on cross-modal matching task with both image and audio as question modality in two separate tests. If there is indeed an optimal learning condition it should reflect in better test performance in that learning condition. We found that there was not a significant difference between performance due to order or delay of encountering image or audio in both the tests.en_US
dc.language.isoen_USen_US
dc.relation.ispartofseries;ET00536
dc.rightsI grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertationen_US
dc.subjectAudiovisual associationen_US
dc.subjectdeep networksen_US
dc.subjectcatastrophic forgettingen_US
dc.subjectClass incremental learningen_US
dc.subject.classificationResearch Subject Categories::TECHNOLOGY::Electrical engineering, electronics and photonics::Electrical engineeringen_US
dc.titleAudio-Visual association learning in Humans and Multimodal Networksen_US
dc.typeThesisen_US
dc.degree.nameMTech (Res)en_US
dc.degree.levelMastersen_US
dc.degree.grantorIndian Institute of Scienceen_US
dc.degree.disciplineEngineeringen_US


Files in this item

This item appears in the following Collection(s)

Show simple item record