Loop Transformations for Multi-/Many-Core Architectures using Machine Learning

Babalad, Shilpa

dc.contributor.advisor	Govindarajan, R
dc.contributor.advisor	Jacob, Matthew T
dc.contributor.author	Babalad, Shilpa
dc.date.accessioned	2025-02-12T09:01:02Z
dc.date.available	2025-02-12T09:01:02Z
dc.date.submitted	2025
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/6807
dc.description.abstract	Loop transformation techniques such as loop tiling, loop interchange and unroll-and-jam help expose better coarse-grain and fine-grain data-level parallelisms as well as exploit data locality. These transformations are important to extract better performance on modern multi-core and many-core architectures. In applying these loop transformation techniques, identifying the tile size to be used, or the loop order which gives better performance, or determining whether or not applying unroll-and-jam is beneficial are critical for achieving better performance. This, however, is complex and requires building a good cost model for execution cycles. Constructing such a cost model that represents the target architecture under consideration is hard due to the interplay among multi-level cache hierarchies, different levels of parallelisms exploited by the architecture, hardware prefetch configurations, and optimization phases of a compiler. Existing polyhedral loop transformation techniques identify and expose different types of parallelism and data locality. They transform a given loop into a tiled loop with a legal loop order that satisfies all the data dependencies in the original loop. They provide options to apply optimizations like loop unroll-and-jam. However, they often fail to pick the best-performing tile size and loop order and to identify when to apply the unroll-and-jam transformation and what is the best loop order. In this thesis, we address the problem of generating efficient code for multi-dimensional loop nests, specifically focusing on 2-dimensional loops, targeting multi-/many-core architectures, when loop transformations such as tiling, interchange, and unroll-and-jam are considered. We propose using supervised machine learning techniques to address the above problems. More specifically, we formulate the problems as classification problems. However, supervised methods need a large amount of representative real-world loops that can be used as training data sets. To overcome the problem of limited training data set, we propose to develop a synthetic loop generator tool. The synthetic loops generated by our tool are modelled based on real-world loops. The machine learning models developed in this thesis, are trained using synthetic loops generated by our tool. We propose to address the above stated problems in three parts. First, for a given tile size, we build Support Vector Machine (SVM) based classifiers to identify the best-performing loop order for two different target architectures: i) a multi-core Intel Xeon Cascadelake and ii) a many-core Intel Xeon Phi Knights Landing (KNL) and test them on loop nests from Polybench test suite. We explore two different data prefetch configurations of the KNL architecture. The SVM based classifiers are trained using the training data set generated by our tool, to identify the optimal loop order. We then study the use of an ensemble model to make the prediction of optimal loop order robust. Experimental evaluation demonstrates that, our ensemble model identifies near-optimal loop orders that are within 8.80% and 6.54% of the optimal loop order for the two test data sets studied on the Intel Xeon Cascadelake architecture. On the Intel Xeon Phi Knights Landing, our ensemble model identifies near-optimal loop orders that are within 3.43% and 7.20% of the optimal loop order for the two prefetch configurations studied. Our approach outperforms state-of-the-art open-source compiler frameworks Polly and Pluto with a geometric mean speedup of 1.08x to 1.87x. Next, to identify the best-performing <tile size, loop order> combination, we propose to develop an hierarchical classifier. We develop a carefully tuned hierarchical classifier, making use of domain knowledge, where each level in the hierarchical classifier uses an SVM classifier. The hierarchical classifier is trained using the synthetic loops generated by our tool and tested on the same set of loops from Polybench test suite, for the same target architectures stated above. Performance evaluation demonstrates that, the tile size and loop order predicted by our SVM-based classifier results in transformed loops whose performance is within 18% and 9% of the optimal performance for two different test data sets on Intel Xeon Cascadelake system and 18% and 13% of the optimal performance for two different prefetch configurations on Intel Xeon Phi (KNL) system. Further, our method outperforms Polly and Pluto with a geometric mean speedup of 1.33x to 2.42x. Last, to determine whether the unroll-and-jam optimization benefits a given loop and which loop order performs the best, we propose to use an heuristic approach. Our simple yet practical heuristic uses the characteristics of a given loop, represented as high-level features, to decide whether unroll-and-jam is beneficial and the best-performing loop order. We supplement the heuristic approach with a simple SVM classifier model and show both approaches perform equally well. We evaluate our heuristic on loops from the Polybench test suite and a set of loops generated using our synthetic loop generator tool on the Intel Xeon Cascadelake system. Experimental evaluation demonstrates that, our approach results in performance that is within 6.51% and 12.97% of the best performance for Polybench and synthetic loops benchmark suites, respectively and outperforms Polly and Pluto with a geometric mean speedup of 1.15x to 1.79x.	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	;ET00817
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Compiler Transformations	en_US
dc.subject	Supervised Machine Learning	en_US
dc.subject	Multi/Many-Core Architectures	en_US
dc.subject	Loop transformations	en_US
dc.subject	Vectorization	en_US
dc.subject	Parallelization	en_US
dc.subject	Supervised learning	en_US
dc.subject	Support Vector Machine	en_US
dc.subject	Compiler Heuristics	en_US
dc.subject	Loop Tiling	en_US
dc.subject	Loop Permutation	en_US
dc.subject	Unroll-and-Jam	en_US
dc.subject	Knights Landing	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Information technology::Computer science	en_US
dc.title	Loop Transformations for Multi-/Many-Core Architectures using Machine Learning	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: Loop_Transformations_For_Multi ...
Size:: 3.785Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Computer Science and Automation (CSA) [394]

Show simple item record