Audio-Visual association learning in Humans and Multimodal Networks

Harjpal, Chandrakant

View/Open

Thesis full text (1.552Mb)

Author

Harjpal, Chandrakant

Metadata

Show full item record

Abstract

We easily learn audiovisual associations when we give visual objects their names. While humans easily learn the names of new objects while retaining previously learned information, deep neural networks forget old associations when trained with new ones, a phenomenon called catastrophic forgetting. In this thesis, I have performed two studies to characterize human and deep network performance on learning novel audiovisual associations. In Study 1, we compared humans and a multimodal deep network performance on learning novel object-word associations and decay in performance of initially encountered pairs after they learn more pairs. We selected 60 object-word pairs from the Novel Object and Unusual Name (NOUN) dataset and performed equivalent experiments on both humans and deep networks. In the human experiments, participants performed 6 sessions of learning and testing novel object-word associations. In each session, they were asked to memorize 10 novel object-word pairs each time and were tested by presenting them with the spoken word (in a different voice/accent) and were asked to identify the associated image (in a different color/orientation) among the all the object images encountered in the session. This test was performed for some objects immediately and for others after a new learning session. Human accuracy was 59% on immediate test and decreased only slightly when tested after larger intervals. Participant accuracy was significantly smaller on delayed test compared to the immediate test. In the deep network experiment, we used audiovisual network with image and audio subnetwork and did triplet loss training to learn object-word associations in the same way as the human experiments. On each session, the network was trained on 10 object-word pairs and tested by presenting an audio word and finding the nearest matching image among the full set. We evaluated two scenarios: a vanilla network with no constraint on weights, a network with elastic weight consolidation (EWC). We found a decrease in performance of initially encountered pairs after network was trained with new pairs in vanilla setting but improved with Elastic weight consolidation method. We matched Current sessions accuracies with human performance to compare on forgetting performance and saw Vanilla network is worse than human performance, but EWC was performing better than humans. In Study 2, we investigated if there is an order preference between Images and audio modality during learning of audiovisual association in Humans, i.e., if the image is presented before audio, is it better than audio first? We distributed pairs with different learning conditions which varied in either order of image and audio presentation or delay between end of one modality and start of presentation of second modality, the time delay values were 0 ms, 500 ms and 1000 ms. The pairs are shown in different learning conditions to subjects and then tested on cross-modal matching task with both image and audio as question modality in two separate tests. If there is indeed an optimal learning condition it should reflect in better test performance in that learning condition. We found that there was not a significant difference between performance due to order or delay of encountering image or audio in both the tests.

URI

https://etd.iisc.ac.in/handle/2005/6527

Collections

Electrical Engineering (EE) [359]