Towards Learning Adversarially Robust Deep Learning Models
Deep learning models have shown impressive performance across a wide spectrum of computer vision applications, including medical diagnosis and autonomous driving. One of the major concerns that these models face is their susceptibility to adversarial samples: samples with small, crafted noise designed to manipulate the model’s prediction. A defense mechanism named Adversarial Training (AT) shows promising results against these attacks. This training regime augments mini-batches with adversaries. However, to scale this training to large networks and datasets, fast and simple methods (e.g., single-step methods such as Fast Gradient Sign Method (FGSM)), are essential for generating these adversaries. But, single-step adversarial training (e.g., FGSM adversarial training) converges to a degenerate minimum, where the model merely appears to be robust. As a result, models are vulnerable to simple black-box attacks. In this thesis, we explore the following aspects of adversarial training: Failure of Single-step Adversarial Training: In the first part of the thesis, we will demonstrate that the pseudo robustness of an adversarially trained model is due to the limitations in the existing evaluation procedure. Further, we introduce novel variants of white-box and black-box attacks, dubbed “gray-box adversarial attacks”, based on which we propose a novel evaluation method to assess the robustness of the learned models. A novel variant of adversarial training named “Gray-box Adversarial Training” that uses intermediate versions of the model to seed the adversaries is proposed to improve the model’s robustness. Regularizers for Single-step Adversarial Training: In this part of the thesis, we will discuss various regularizers that could help to learn robust models using single-step adversarial training methods. (i) Regularizer that enforces logits for FGSM and I-FGSM (iterative-FGSM) of a clean sample, to be similar (imposed on only one pair of an adversarial sample in a mini-batch), (ii) Regularizer that enforces logits for FGSM and R-FGSM (Random+FGSM) of a clean sample, to be similar, (iii) Monotonic loss constraint: Enforces the loss to increase monotonically with an increase in the perturbation size of the FGSM attack, and (iv) Dropout with decaying dropout probability: Introduces dropout layer with decaying dropout probability, after each nonlinear layer of a network. Incorporating Domain Knowledge to Improve Model’s Adversarial Robustness: In this final part of the thesis, we show that the existing normal training method fails to incorporate domain knowledge into the learned feature representation of the network. Further, we show that incorporating domain knowledge into the learned feature representation of the network results in a significant improvement in the robustness of the network against adversarial attacks, within normal training regime.