Fragile Interpretations and Interpretable models in NLP
Abstract
Deploying deep learning models in critical areas where the cost of making a wrong decision
leads to a substantial financial loss, like in the banking domain, or even loss of life, like in the
medical field, is significantly less. We cannot entirely rely on deep learning models as they
act as black boxes for us. This problem can be resolved by Explainable AI, which aims to
explain these black boxes. There are two approaches to explaining these black boxes, one being
via posthoc explainability techniques and the other being by designing inherently interpretable
models. These two are the basis of our work.
In the first part, we talk about the instability of posthoc explanations, leading to fragile in-
terpretations. This work focuses on the robustness of NLP models along with the robustness
of interpretations. We have proposed an algorithm that perturbs the input text such that the
generated text is semantically, conceptually, and grammatically similar to the input text, yet
the interpretations produced are fragile. Through our experiments, we have shown how the
interpretations of two very similar sentences vary significantly. We have shown that posthoc
explanations can be unstable, inconsistent, unfaithful, and fragile; and, therefore, cannot be
trusted. Finally, we have concluded whether to trust the robust NLP models or the posthoc
explanations.
In the second part, we have designed two inherently interpretable models, one for offensive lan-
guage detection tasks in the case of multi-task learning for three subtasks, sharing a hierarchical
relationship between them and the other for the question pair similarity task. Our offensive
language detection model achieved an F1 score of 0.78 on the OLID dataset and 0.85 on the
SOLID dataset. Our question pair similarity model achieved an F1 score of 0.83. We also
provide a detailed analysis of the model interpretability as well as prediction interpretability.