Fragile Interpretations and Interpretable models in NLP

Kohli, Ayushi

View/Open

Thesis full text (2.047Mb)

Author

Kohli, Ayushi

Metadata

Show full item record

Abstract

Deploying deep learning models in critical areas where the cost of making a wrong decision leads to a substantial financial loss, like in the banking domain, or even loss of life, like in the medical field, is significantly less. We cannot entirely rely on deep learning models as they act as black boxes for us. This problem can be resolved by Explainable AI, which aims to explain these black boxes. There are two approaches to explaining these black boxes, one being via posthoc explainability techniques and the other being by designing inherently interpretable models. These two are the basis of our work. In the first part, we talk about the instability of posthoc explanations, leading to fragile in- terpretations. This work focuses on the robustness of NLP models along with the robustness of interpretations. We have proposed an algorithm that perturbs the input text such that the generated text is semantically, conceptually, and grammatically similar to the input text, yet the interpretations produced are fragile. Through our experiments, we have shown how the interpretations of two very similar sentences vary significantly. We have shown that posthoc explanations can be unstable, inconsistent, unfaithful, and fragile; and, therefore, cannot be trusted. Finally, we have concluded whether to trust the robust NLP models or the posthoc explanations. In the second part, we have designed two inherently interpretable models, one for offensive lan- guage detection tasks in the case of multi-task learning for three subtasks, sharing a hierarchical relationship between them and the other for the question pair similarity task. Our offensive language detection model achieved an F1 score of 0.78 on the OLID dataset and 0.85 on the SOLID dataset. Our question pair similarity model achieved an F1 score of 0.83. We also provide a detailed analysis of the model interpretability as well as prediction interpretability.

URI

https://etd.iisc.ac.in/handle/2005/6244

Collections

Computer Science and Automation (CSA) [395]