Deep Learning for Bug Localization and Program Repair
In this thesis, we focus on the problem of program debugging and present novel deep learning based techniques for bug-localization and program repair. Deep learning techniques have been successfully applied to a variety of tasks in natural language processing over the years. Although natural language text and programs are similar to some extent, the latter have procedural interpretation and richer structure. Applying deep learning techniques to programs presents many novel challenges, which arise due to these differences. We address some of these challenges in this thesis. Most of the existing program debugging research is dominated by formal, theory-first approaches. These approaches fail to take advantage of the existing codebases available online in the form of open source software repositories and student assignment submissions to massive open online courses on programming. Recognizing this, researchers have begun to replace expert-designed heuristics with models learned from codebases to improve the performance of the conventional debugging techniques. This thesis shows that it is possible to solve program debugging problems directly from raw programs using deep learning techniques in an end-to-end manner. More specifically, we present three approaches for bug-localization and program repair that are entirely data-driven and learn to perform their task instead of following the steps specified by a domain expert. We first introduce the notion of common programming errors and present a deep neural network based end-to-end technique, called DeepFix, that can fix multiple such errors in a program without relying on any external tool to locate or fix them. At the heart of DeepFix is a multi-layered sequence-to-sequence neural network equipped with an attention mechanism, comprising an encoder recurrent neural network (RNN) to process the input and a decoder RNN that generates the output by attending to the encoded input. The network is trained on a labeled dataset to predict a faulty program location along with the corrected program statement. Multiple errors in a program can be fixed by invoking DeepFix iteratively. Our experiments demonstrate that DeepFix is effective and fixes thousands of programs. While repositories containing erroneous programs are easily available, the labels of these programs (faulty program location and correct statement) required by DeepFix are not easily available. Labeling a large number of erroneous programs is a daunting task. To address this issue, we propose a novel deep reinforcement learning based technique, called RLAssist, that does not require labeled training data and still matches the performance of DeepFix. At the core of RLAssist is a novel programming language correction framework amenable to reinforcement learning. The framework allows an agent to mimic human actions for text navigation and editing. We demonstrate that the agent can be trained through self-exploration directly from the raw input, that is, the program text itself, without any prior knowledge of the formal syntax of the programming language. Reinforcement learning techniques are, however, usually slow to train. We also show that RLAssist can be trained much faster by leveraging expert demonstrations for as little as one-tenth of its training data, which also helps it in achieving better performance than DeepFix. Finally, we present a deep learning based technique for semantic bug-localization in programs with respect to failing tests. The proposed technique works in two phases. In the first phase, a novel tree convolutional neural network is used to predict whether a program passes or fails a given test. In the second phase, we query a state-of-the-art neural prediction attribution technique to find out which lines of the programs make the network predict the failures to localize the bugs. This is a static bug-localization technique which requires neither program instrumentation nor multiple executions, as is necessary for existing dynamic bug-localization techniques. Our experiments show that the proposed technique is competitive with two state-of-the-art program spectrum based and one syntactic difference based bug-localization baselines. All the techniques proposed in this thesis are programming language agnostic. We believe that the ideas and tools developed in this work can potentially be a road-map for future attempts at applying deep learning techniques to more problems in software engineering research.