Towards Learning Adversarially Robust Deep Learning Models
Abstract
Deep learning models have shown impressive performance across a wide spectrum of computer vision
applications, including medical diagnosis and autonomous driving. One of the major concerns that
these models face is their susceptibility to adversarial samples: samples with small, crafted noise
designed to manipulate the model’s prediction. A defense mechanism named Adversarial Training
(AT) shows promising results against these attacks. This training regime augments mini-batches with
adversaries. However, to scale this training to large networks and datasets, fast and simple methods
(e.g., single-step methods such as Fast Gradient Sign Method (FGSM)), are essential for generating
these adversaries. But, single-step adversarial training (e.g., FGSM adversarial training) converges
to a degenerate minimum, where the model merely appears to be robust. As a result, models are
vulnerable to simple black-box attacks. In this thesis, we explore the following aspects of adversarial
training:
Failure of Single-step Adversarial Training: In the first part of the thesis, we will demonstrate
that the pseudo robustness of an adversarially trained model is due to the limitations in the existing
evaluation procedure. Further, we introduce novel variants of white-box and black-box attacks,
dubbed “gray-box adversarial attacks”, based on which we propose a novel evaluation method to assess
the robustness of the learned models. A novel variant of adversarial training named “Gray-box
Adversarial Training” that uses intermediate versions of the model to seed the adversaries is proposed
to improve the model’s robustness.
Regularizers for Single-step Adversarial Training: In this part of the thesis, we will discuss
various regularizers that could help to learn robust models using single-step adversarial training methods.
(i) Regularizer that enforces logits for FGSM and I-FGSM (iterative-FGSM) of a clean sample,
to be similar (imposed on only one pair of an adversarial sample in a mini-batch), (ii) Regularizer
that enforces logits for FGSM and R-FGSM (Random+FGSM) of a clean sample, to be similar, (iii)
Monotonic loss constraint: Enforces the loss to increase monotonically with an increase in the perturbation
size of the FGSM attack, and (iv) Dropout with decaying dropout probability: Introduces
dropout layer with decaying dropout probability, after each nonlinear layer of a network.
Incorporating Domain Knowledge to Improve Model’s Adversarial Robustness: In this final part of the thesis, we show that the existing normal training method fails to incorporate domain
knowledge into the learned feature representation of the network. Further, we show that incorporating
domain knowledge into the learned feature representation of the network results in a significant
improvement in the robustness of the network against adversarial attacks, within normal training
regime.