Loop Transformations for Multi-/Many-Core Architectures using Machine Learning

Babalad, Shilpa

View/Open

Thesis full text (3.785Mb)

Author

Babalad, Shilpa

Metadata

Show full item record

Abstract

Loop transformation techniques such as loop tiling, loop interchange and unroll-and-jam help expose better coarse-grain and fine-grain data-level parallelisms as well as exploit data locality. These transformations are important to extract better performance on modern multi-core and many-core architectures. In applying these loop transformation techniques, identifying the tile size to be used, or the loop order which gives better performance, or determining whether or not applying unroll-and-jam is beneficial are critical for achieving better performance. This, however, is complex and requires building a good cost model for execution cycles. Constructing such a cost model that represents the target architecture under consideration is hard due to the interplay among multi-level cache hierarchies, different levels of parallelisms exploited by the architecture, hardware prefetch configurations, and optimization phases of a compiler. Existing polyhedral loop transformation techniques identify and expose different types of parallelism and data locality. They transform a given loop into a tiled loop with a legal loop order that satisfies all the data dependencies in the original loop. They provide options to apply optimizations like loop unroll-and-jam. However, they often fail to pick the best-performing tile size and loop order and to identify when to apply the unroll-and-jam transformation and what is the best loop order. In this thesis, we address the problem of generating efficient code for multi-dimensional loop nests, specifically focusing on 2-dimensional loops, targeting multi-/many-core architectures, when loop transformations such as tiling, interchange, and unroll-and-jam are considered. We propose using supervised machine learning techniques to address the above problems. More specifically, we formulate the problems as classification problems. However, supervised methods need a large amount of representative real-world loops that can be used as training data sets. To overcome the problem of limited training data set, we propose to develop a synthetic loop generator tool. The synthetic loops generated by our tool are modelled based on real-world loops. The machine learning models developed in this thesis, are trained using synthetic loops generated by our tool. We propose to address the above stated problems in three parts. First, for a given tile size, we build Support Vector Machine (SVM) based classifiers to identify the best-performing loop order for two different target architectures: i) a multi-core Intel Xeon Cascadelake and ii) a many-core Intel Xeon Phi Knights Landing (KNL) and test them on loop nests from Polybench test suite. We explore two different data prefetch configurations of the KNL architecture. The SVM based classifiers are trained using the training data set generated by our tool, to identify the optimal loop order. We then study the use of an ensemble model to make the prediction of optimal loop order robust. Experimental evaluation demonstrates that, our ensemble model identifies near-optimal loop orders that are within 8.80% and 6.54% of the optimal loop order for the two test data sets studied on the Intel Xeon Cascadelake architecture. On the Intel Xeon Phi Knights Landing, our ensemble model identifies near-optimal loop orders that are within 3.43% and 7.20% of the optimal loop order for the two prefetch configurations studied. Our approach outperforms state-of-the-art open-source compiler frameworks Polly and Pluto with a geometric mean speedup of 1.08x to 1.87x. Next, to identify the best-performing <tile size, loop order> combination, we propose to develop an hierarchical classifier. We develop a carefully tuned hierarchical classifier, making use of domain knowledge, where each level in the hierarchical classifier uses an SVM classifier. The hierarchical classifier is trained using the synthetic loops generated by our tool and tested on the same set of loops from Polybench test suite, for the same target architectures stated above. Performance evaluation demonstrates that, the tile size and loop order predicted by our SVM-based classifier results in transformed loops whose performance is within 18% and 9% of the optimal performance for two different test data sets on Intel Xeon Cascadelake system and 18% and 13% of the optimal performance for two different prefetch configurations on Intel Xeon Phi (KNL) system. Further, our method outperforms Polly and Pluto with a geometric mean speedup of 1.33x to 2.42x. Last, to determine whether the unroll-and-jam optimization benefits a given loop and which loop order performs the best, we propose to use an heuristic approach. Our simple yet practical heuristic uses the characteristics of a given loop, represented as high-level features, to decide whether unroll-and-jam is beneficial and the best-performing loop order. We supplement the heuristic approach with a simple SVM classifier model and show both approaches perform equally well. We evaluate our heuristic on loops from the Polybench test suite and a set of loops generated using our synthetic loop generator tool on the Intel Xeon Cascadelake system. Experimental evaluation demonstrates that, our approach results in performance that is within 6.51% and 12.97% of the best performance for Polybench and synthetic loops benchmark suites, respectively and outperforms Polly and Pluto with a geometric mean speedup of 1.15x to 1.79x.

URI

https://etd.iisc.ac.in/handle/2005/6807

Collections

Computer Science and Automation (CSA) [396]