Audio-Visual association learning in Humans and Multimodal Networks

Harjpal, Chandrakant

dc.contributor.advisor	Ganapathy, Sriram
dc.contributor.advisor	Arun, S P
dc.contributor.author	Harjpal, Chandrakant
dc.date.accessioned	2024-06-13T09:38:59Z
dc.date.available	2024-06-13T09:38:59Z
dc.date.submitted	2023
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/6527
dc.description.abstract	We easily learn audiovisual associations when we give visual objects their names. While humans easily learn the names of new objects while retaining previously learned information, deep neural networks forget old associations when trained with new ones, a phenomenon called catastrophic forgetting. In this thesis, I have performed two studies to characterize human and deep network performance on learning novel audiovisual associations. In Study 1, we compared humans and a multimodal deep network performance on learning novel object-word associations and decay in performance of initially encountered pairs after they learn more pairs. We selected 60 object-word pairs from the Novel Object and Unusual Name (NOUN) dataset and performed equivalent experiments on both humans and deep networks. In the human experiments, participants performed 6 sessions of learning and testing novel object-word associations. In each session, they were asked to memorize 10 novel object-word pairs each time and were tested by presenting them with the spoken word (in a different voice/accent) and were asked to identify the associated image (in a different color/orientation) among the all the object images encountered in the session. This test was performed for some objects immediately and for others after a new learning session. Human accuracy was 59% on immediate test and decreased only slightly when tested after larger intervals. Participant accuracy was significantly smaller on delayed test compared to the immediate test. In the deep network experiment, we used audiovisual network with image and audio subnetwork and did triplet loss training to learn object-word associations in the same way as the human experiments. On each session, the network was trained on 10 object-word pairs and tested by presenting an audio word and finding the nearest matching image among the full set. We evaluated two scenarios: a vanilla network with no constraint on weights, a network with elastic weight consolidation (EWC). We found a decrease in performance of initially encountered pairs after network was trained with new pairs in vanilla setting but improved with Elastic weight consolidation method. We matched Current sessions accuracies with human performance to compare on forgetting performance and saw Vanilla network is worse than human performance, but EWC was performing better than humans. In Study 2, we investigated if there is an order preference between Images and audio modality during learning of audiovisual association in Humans, i.e., if the image is presented before audio, is it better than audio first? We distributed pairs with different learning conditions which varied in either order of image and audio presentation or delay between end of one modality and start of presentation of second modality, the time delay values were 0 ms, 500 ms and 1000 ms. The pairs are shown in different learning conditions to subjects and then tested on cross-modal matching task with both image and audio as question modality in two separate tests. If there is indeed an optimal learning condition it should reflect in better test performance in that learning condition. We found that there was not a significant difference between performance due to order or delay of encountering image or audio in both the tests.	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	;ET00536
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Audiovisual association	en_US
dc.subject	deep networks	en_US
dc.subject	catastrophic forgetting	en_US
dc.subject	Class incremental learning	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Electrical engineering, electronics and photonics::Electrical engineering	en_US
dc.title	Audio-Visual association learning in Humans and Multimodal Networks	en_US
dc.type	Thesis	en_US
dc.degree.name	MTech (Res)	en_US
dc.degree.level	Masters	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: Chandrakant_thesis_final.pdf
Size:: 1.552Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Electrical Engineering (EE) [359]

Show simple item record