Deep Visual Representations: A study on Augmentation, Visualization, and Robustness
Abstract
Deep neural networks have resulted in unprecedented performances for various learning tasks. Particularly,
Convolutional Neural Networks (CNNs) are shown to learn representations that can efficiently
discriminate hundreds of visual categories. They learn a hierarchy of representations ranging from low
level edge and blob detectors to semantic features such as object categories. These representations can
be employed as off-the-shelf visual features in various vision tasks such as image classification, scene
retrieval, caption generation, etc.In this thesis, we investigate three important aspects of the representations
learned by the CNNs: (i) Augmentation: incorporating useful side and additional information
to augment the learned visual representations, (ii) Visualization: providing visual explanations for the
predicted inference, and (iii) Robustness: their susceptibility to adversarial perturbations during the
test time.
Augmenting: In the first part of this thesis, we present approaches that exploit the useful side
and additional information to enrich the learned representations with more semantics. Specifically,
we learn to encode additional discriminative information from (i) objectness prior over the image
regions, and (ii) strong supervision offered by the captions given by human subjects that describe the
image contents.
Objectness prior: In order to encode comprehensive visual information from a scene, existing
methods typically employ deep-learned visual representations in a sliding window framework. This
approach is tedious which demands more computation, and is exhaustive. On the other hand, scenes
are typically composed of objects, i.e., it is the objects that make a scene what it is. We exploit
objectness information while aggregating the visual features from individual image regions into a
compact image representation. Restricting the description to only object like regions drastically reduces
the number of image patches to be considered and automatically takes care of the scale. Owing
to the robust object representations learned by the CNNs, our aggregated image representations exhibit
improved invariance to general image transformations such as translation, rotation and scaling.
The proposed representation could discriminate images even under extreme dimensionality reduction,
including binarization.
Strong supervision: In a typical supervised learning setting for object recognition, labels offer
only weak supervision. All that a label provides is presence or absence of an object in an image. It
neglects a lot of useful information about the actual object, such as, attributes, context, etc. Image
captions on the other hand, provide rich information about image contents. Therefore, in order to
enhance the reprsentations, we exploit the image captions as strong supervision for the application of
object retrieval. We show that strong supervision when served with pairwise constraints, can help the
representations to better learn the graded (non-binary) relevances between pairs of images.
Visualization: Despite their impressive performance, CNNs offer limited transparency, therefore,
are treated as black boxes. Increasing depth, intricate architectures and sophisticated regularizers
make them complex machine learning models. One way to make them transparent is to provide visual
explanations for their predictions, i.e., visualizing the image regions that guide their predictions and
thereby making them explainable. In the second part of the thesis, we develop a novel visualization
method to locate the evidence in the input for a given activation at any layer in the architecture.
Unlike most existing methods that rely on gradient computation, we directly exploit the dependencies
across the learned representations to make the CNNs more interactive. Our method enables various
applications such as visualizing the evidence for a given activation (e.g. predicted label), grounding
a predicted caption, object detection, etc. in a weakly-supervised setup.
Robustness: Along with successful adaption across various vision tasks, the learned representations
are also observed to be unstable to addition of special noise of small magnitude, called adversarial
perturbations. Thus, the third and final part of the thesis focuses on the stability of the
representations to this additive perturbations.
Generalizable data-free objectives: These additive perturbations make the CNNs susceptible to
produce inaccurate predictions with high confidence and threaten their deployability in the real world.
In order to craft these perturbations (either image specific or agnostic), existing methods solve complex
fooling objectives that require samples from target data distribution. Also, the existing methods
to craft image-agnostic perturbations are task specific, i.e, the objectives are designed to suit the
underlying task. For the first time, we introduce generalizable and data-free objectives to craft imageagnostic
adversarial perturbations. Our objective generalizes across multiple vision tasks such as
object recognition, semantic segmentation, depth estimation and can efficiently craft perturbations
that can effectively fool. Our objective exposes the fragility of the learned representations even in
the black-box attacking scenario, where no information about the target model is known. In spite of
being data-free, our objectives can exploit the minimal available prior information about the training
distribution such as the dynamic range of the images in order to craft stronger attacks.
Modeling the adversaries: Most existing methods present optimization approaches to craft adversarial
perturbations. Also, for a given classifier, they generate one perturbation at a time, which
is a single instance from a possibly big manifold of adversarial perturbations. Further, in order to
build robust models, it is essential to explore the manifold of adversarial perturbations. We propose
for the first time, a generative approach to model the distribution of such perturbations in both “data
dependent” and “data-free” scenarios. Our generative model is inspired from Generative Adversarial
Networks (GAN) and is trained using fooling and diversity objectives. The proposed generator
network captures the distribution of adversarial perturbations for a given classifier and readily generates
a wide variety of such perturbations. We demonstrate that perturbations crafted by our model (i)
achieve state-of-the-art fooling rates, (ii) exhibit wide variety and (iii) deliver excellent cross model
generalizability. Our work can be deemed as an important step in the process of inferring about the
complex manifolds of adversarial perturbations. This knowledge of adversaries can be exploited to
learn better representations that are robust to various attacks.