Concurrency Analysis and Mining Techniques for APIs
MetadataShow full item record
Software components expose Application Programming Interfaces (APIs) as a means to access their functionality, and facilitate reuse. Developers use APIs supplied by programming languages to access the core data structures and algorithms that are part of the language framework. They use the APIs of third-party libraries for specialized tasks. Thus, APIs play a central role in mediating a developer's interaction with software, and the interaction between different software components. However, APIs are often large, complex and hard to navigate. They may have hundreds of classes and methods, with incomplete or obsolete documentation. They may encapsulate concurrency behaviour that the developer is unaware of. Finding the right functionality in a large API, using APIs correctly, and maintaining software that uses a constantly evolving API are challenges that every developer runs into. In this thesis, we design automated techniques to address two problems pertaining to APIs (1) Concurrency analysis of APIs, and (2) API mining. Speci cally, we consider the concurrency analysis of asynchronous APIs, and mining of math APIs to infer the functional behaviour of API methods. The problem of concurrency bugs such as race conditions and deadlocks has been well studied for multi-threaded programs. However, developers have been eschewing a pure multi-threaded programming model in favour of asynchronous programming models supported by asynchronous APIs. Asynchronous programs and multi-threaded programs have different semantics, due to which existing techniques to analyze the latter cannot detect bugs present in programs that use asynchronous APIs. This thesis addresses the problem of concurrency analysis of programs that use asynchronous APIs in an end-to-end fashion. We give operational semantics for important classes of asynchronous and event-driven systems. The semantics are designed by carefully studying real software and serve to clarify subtleties in scheduling. We use the semantics to inform the design of novel algorithms to find races and deadlocks. We implement the algorithms in tools, and show their effectiveness by finding serious bugs in popular open-source software. To begin with, we consider APIs for asynchronous event-driven systems supporting pro-grammatic event loops. Here, event handlers can spin event loops programmatically in addition to the runtime's default event loop. This concurrency idiom is supported by important classes of APIs including GUI, web browser, and OS APIs. Programs that use these APIs are prone to interference between a handler that is spinning an event loop and another handler that runs inside the loop. We present the first happens-before based race detection technique for such programs. Next, we consider the asynchronous programming model of modern languages like C]. In spite of providing primitives for the disciplined use of asynchrony, C] programs can deadlock because of incorrect use of blocking APIs along with non-blocking (asynchronous) APIs. We present the rst deadlock detection technique for asynchronous C] programs. We formulate necessary conditions for deadlock using a novel program representation that represents procedures and continuations, control ow between them and the threads on which they may be scheduled. We design a static analysis to construct the pro-gram representation and use it to identify deadlocks. Our ideas have resulted in research tools with practical impact. Sparse Racer, our tool to detect races, found 13 previously unknown use-after-free bugs in KDE Linux applications. Dead Wait, our deadlock detector, found 43 previously unknown deadlocks in asynchronous C] libraries. Developers have fixed 43 of these races and deadlocks, indicating that our techniques are useful in practice to detect bugs that developers consider worth fixing. Using large APIs effectively entails finding the right functionality and calling the methods that implement it correctly, possibly composing many API elements. Automatically infer-ring the information required to do this is a challenge that has attracted the attention of the research community. In response, the community has introduced many techniques to mine APIs and produce information ranging from usage examples and patterns, to protocols governing the API method calling sequences. We show how to mine unit tests to match API methods to their functional behaviour, for the specific but important class of math APIs. Math APIs are at the heart of many application domains ranging from machine learning to scientific computations, and are supplied by many competing libraries. In contrast to obtaining usage examples or identifying correct call sequences, the challenge in this domain is to infer API methods required to perform a particular mathematical computation, and to compose them correctly. We let developers specify mathematical computations naturally, as a math expression in the notation of interpreted languages (such as Matlab). Our unit test mining technique maps subexpressions to math API methods such that the method's functional behaviour matches the subexpression's executable semantics, as defined by the interpreter. We apply our technique, called MathFinder, to math API discovery and migration, and validate it in a user study. Developers who used MathFinder nished their programming tasks twice as fast as their counterparts who used the usual techniques like web and code search, and IDE code completion. We also demonstrate the use of MathFinder to assist in the migration of Weka, a popular machine learning library, to a different linear algebra library.