Show simple item record

dc.contributor.advisorVenkatesh Babu, R
dc.contributor.authorMopuri, Konda Reddy
dc.date.accessioned2021-10-20T10:24:05Z
dc.date.available2021-10-20T10:24:05Z
dc.date.submitted2018
dc.identifier.urihttps://etd.iisc.ac.in/handle/2005/5446
dc.description.abstractDeep neural networks have resulted in unprecedented performances for various learning tasks. Particularly, Convolutional Neural Networks (CNNs) are shown to learn representations that can efficiently discriminate hundreds of visual categories. They learn a hierarchy of representations ranging from low level edge and blob detectors to semantic features such as object categories. These representations can be employed as off-the-shelf visual features in various vision tasks such as image classification, scene retrieval, caption generation, etc.In this thesis, we investigate three important aspects of the representations learned by the CNNs: (i) Augmentation: incorporating useful side and additional information to augment the learned visual representations, (ii) Visualization: providing visual explanations for the predicted inference, and (iii) Robustness: their susceptibility to adversarial perturbations during the test time. Augmenting: In the first part of this thesis, we present approaches that exploit the useful side and additional information to enrich the learned representations with more semantics. Specifically, we learn to encode additional discriminative information from (i) objectness prior over the image regions, and (ii) strong supervision offered by the captions given by human subjects that describe the image contents. Objectness prior: In order to encode comprehensive visual information from a scene, existing methods typically employ deep-learned visual representations in a sliding window framework. This approach is tedious which demands more computation, and is exhaustive. On the other hand, scenes are typically composed of objects, i.e., it is the objects that make a scene what it is. We exploit objectness information while aggregating the visual features from individual image regions into a compact image representation. Restricting the description to only object like regions drastically reduces the number of image patches to be considered and automatically takes care of the scale. Owing to the robust object representations learned by the CNNs, our aggregated image representations exhibit improved invariance to general image transformations such as translation, rotation and scaling. The proposed representation could discriminate images even under extreme dimensionality reduction, including binarization. Strong supervision: In a typical supervised learning setting for object recognition, labels offer only weak supervision. All that a label provides is presence or absence of an object in an image. It neglects a lot of useful information about the actual object, such as, attributes, context, etc. Image captions on the other hand, provide rich information about image contents. Therefore, in order to enhance the reprsentations, we exploit the image captions as strong supervision for the application of object retrieval. We show that strong supervision when served with pairwise constraints, can help the representations to better learn the graded (non-binary) relevances between pairs of images. Visualization: Despite their impressive performance, CNNs offer limited transparency, therefore, are treated as black boxes. Increasing depth, intricate architectures and sophisticated regularizers make them complex machine learning models. One way to make them transparent is to provide visual explanations for their predictions, i.e., visualizing the image regions that guide their predictions and thereby making them explainable. In the second part of the thesis, we develop a novel visualization method to locate the evidence in the input for a given activation at any layer in the architecture. Unlike most existing methods that rely on gradient computation, we directly exploit the dependencies across the learned representations to make the CNNs more interactive. Our method enables various applications such as visualizing the evidence for a given activation (e.g. predicted label), grounding a predicted caption, object detection, etc. in a weakly-supervised setup. Robustness: Along with successful adaption across various vision tasks, the learned representations are also observed to be unstable to addition of special noise of small magnitude, called adversarial perturbations. Thus, the third and final part of the thesis focuses on the stability of the representations to this additive perturbations. Generalizable data-free objectives: These additive perturbations make the CNNs susceptible to produce inaccurate predictions with high confidence and threaten their deployability in the real world. In order to craft these perturbations (either image specific or agnostic), existing methods solve complex fooling objectives that require samples from target data distribution. Also, the existing methods to craft image-agnostic perturbations are task specific, i.e, the objectives are designed to suit the underlying task. For the first time, we introduce generalizable and data-free objectives to craft imageagnostic adversarial perturbations. Our objective generalizes across multiple vision tasks such as object recognition, semantic segmentation, depth estimation and can efficiently craft perturbations that can effectively fool. Our objective exposes the fragility of the learned representations even in the black-box attacking scenario, where no information about the target model is known. In spite of being data-free, our objectives can exploit the minimal available prior information about the training distribution such as the dynamic range of the images in order to craft stronger attacks. Modeling the adversaries: Most existing methods present optimization approaches to craft adversarial perturbations. Also, for a given classifier, they generate one perturbation at a time, which is a single instance from a possibly big manifold of adversarial perturbations. Further, in order to build robust models, it is essential to explore the manifold of adversarial perturbations. We propose for the first time, a generative approach to model the distribution of such perturbations in both “data dependent” and “data-free” scenarios. Our generative model is inspired from Generative Adversarial Networks (GAN) and is trained using fooling and diversity objectives. The proposed generator network captures the distribution of adversarial perturbations for a given classifier and readily generates a wide variety of such perturbations. We demonstrate that perturbations crafted by our model (i) achieve state-of-the-art fooling rates, (ii) exhibit wide variety and (iii) deliver excellent cross model generalizability. Our work can be deemed as an important step in the process of inferring about the complex manifolds of adversarial perturbations. This knowledge of adversaries can be exploited to learn better representations that are robust to various attacks.en_US
dc.language.isoen_USen_US
dc.relation.ispartofseries;G29282
dc.rightsI grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertationen_US
dc.subjectGenerative Adversarial Networksen_US
dc.subjectConvolutional Neural Networksen_US
dc.subjectAugmentationen_US
dc.subjectVisualizationen_US
dc.subjectImageen_US
dc.subject.classificationResearch Subject Categories::TECHNOLOGY::Information technology::Computer scienceen_US
dc.titleDeep Visual Representations: A study on Augmentation, Visualization, and Robustnessen_US
dc.typeThesisen_US
dc.degree.namePhDen_US
dc.degree.levelDoctoralen_US
dc.degree.grantorIndian Institute of Scienceen_US
dc.degree.disciplineEngineeringen_US


Files in this item

This item appears in the following Collection(s)

Show simple item record