Audio-Visual association learning in Humans and Multimodal Networks
Abstract
We easily learn audiovisual associations when we give visual objects their names. While humans
easily learn the names of new objects while retaining previously learned information, deep neural
networks forget old associations when trained with new ones, a phenomenon called catastrophic
forgetting. In this thesis, I have performed two studies to characterize human and deep network
performance on learning novel audiovisual associations.
In Study 1, we compared humans and a multimodal deep network performance on learning novel
object-word associations and decay in performance of initially encountered pairs after they learn more
pairs. We selected 60 object-word pairs from the Novel Object and Unusual Name (NOUN) dataset and
performed equivalent experiments on both humans and deep networks. In the human experiments,
participants performed 6 sessions of learning and testing novel object-word associations. In each
session, they were asked to memorize 10 novel object-word pairs each time and were tested by
presenting them with the spoken word (in a different voice/accent) and were asked to identify the
associated image (in a different color/orientation) among the all the object images encountered in the
session. This test was performed for some objects immediately and for others after a new learning
session. Human accuracy was 59% on immediate test and decreased only slightly when tested after
larger intervals. Participant accuracy was significantly smaller on delayed test compared to the
immediate test. In the deep network experiment, we used audiovisual network with image and audio
subnetwork and did triplet loss training to learn object-word associations in the same way as the human
experiments. On each session, the network was trained on 10 object-word pairs and tested by
presenting an audio word and finding the nearest matching image among the full set. We evaluated two
scenarios: a vanilla network with no constraint on weights, a network with elastic weight consolidation
(EWC). We found a decrease in performance of initially encountered pairs after network was trained
with new pairs in vanilla setting but improved with Elastic weight consolidation method. We matched
Current sessions accuracies with human performance to compare on forgetting performance and saw
Vanilla network is worse than human performance, but EWC was performing better than humans.
In Study 2, we investigated if there is an order preference between Images and audio modality during
learning of audiovisual association in Humans, i.e., if the image is presented before audio, is it better
than audio first? We distributed pairs with different learning conditions which varied in either order of
image and audio presentation or delay between end of one modality and start of presentation of second
modality, the time delay values were 0 ms, 500 ms and 1000 ms. The pairs are shown in different
learning conditions to subjects and then tested on cross-modal matching task with both image and
audio as question modality in two separate tests. If there is indeed an optimal learning condition it
should reflect in better test performance in that learning condition. We found that there was not a
significant difference between performance due to order or delay of encountering image or audio in
both the tests.