Performance Characterization and Optimizations of Traditional ML Applications
Abstract
Even in the era of Deep Learning based methods, traditional machine learning methods with large data sets continue to attract significant attention. However, we find an apparent lack of a detailed performance characterization of these methods in the context of large training datasets. In this thesis, we study the systems behaviour of a number of traditional ML methods as implemented in popular free software libraries/modules to identify critical performance bottlenecks experienced by these applications. The performance characterization study reveals several interesting insights into the performance of these applications. We observe that the processor backend is the major bottleneck for our workloads, especially poor cache performance, coupled with a high fraction of CPU stall cycles due to memory latency. We also observed a very poor utilization of execution ports with only a single micro-op or no micro-op being executed for around 45% of the execution time. For the tree-based workloads, the CPU stalls due to badspeculation are also significant with values as high as 25% of CPU cycles. Then we evaluate the performance benefits of applying some well-known optimizations at the levels of caches and the main memory. More specifically, we test the usefulness of optimizations such as (i) software prefetching to improve cache performance and (ii) data layout and computation reordering optimizations to improve locality in DRAM accesses. These optimizations are implemented as modifications to the well-known scikit-learn library, which can be easily leveraged by application programmers. We evaluate the impact of the proposed optimizations using a combination of simulation and execution on a real system. The software prefetching optimization was implemented over ten workloads and it resulted in performance benefits varying from 5.2%- 27% on seven out of the ten ML applications while the data layout and computation reordering methods yielded around 8%- 23% performance improvement on seven out of eight neighbour and tree-based ML applications.