Loop Transformations for Multi-/Many-Core Architectures using Machine Learning
Abstract
Loop transformation techniques such as loop tiling, loop interchange and unroll-and-jam help
expose better coarse-grain and fine-grain data-level parallelisms as well as exploit data locality.
These transformations are important to extract better performance on modern multi-core and
many-core architectures. In applying these loop transformation techniques, identifying the tile
size to be used, or the loop order which gives better performance, or determining whether or
not applying unroll-and-jam is beneficial are critical for achieving better performance. This,
however, is complex and requires building a good cost model for execution cycles. Constructing
such a cost model that represents the target architecture under consideration is hard due to
the interplay among multi-level cache hierarchies, different levels of parallelisms exploited by
the architecture, hardware prefetch configurations, and optimization phases of a compiler.
Existing polyhedral loop transformation techniques identify and expose different types of
parallelism and data locality. They transform a given loop into a tiled loop with a legal loop
order that satisfies all the data dependencies in the original loop. They provide options to apply
optimizations like loop unroll-and-jam. However, they often fail to pick the best-performing
tile size and loop order and to identify when to apply the unroll-and-jam transformation and
what is the best loop order.
In this thesis, we address the problem of generating efficient code for multi-dimensional loop
nests, specifically focusing on 2-dimensional loops, targeting multi-/many-core architectures,
when loop transformations such as tiling, interchange, and unroll-and-jam are considered. We
propose using supervised machine learning techniques to address the above problems. More
specifically, we formulate the problems as classification problems. However, supervised methods
need a large amount of representative real-world loops that can be used as training data sets.
To overcome the problem of limited training data set, we propose to develop a synthetic loop
generator tool. The synthetic loops generated by our tool are modelled based on real-world
loops. The machine learning models developed in this thesis, are trained using synthetic loops
generated by our tool.
We propose to address the above stated problems in three parts. First, for a given tile size, we build Support Vector Machine (SVM) based classifiers to identify the best-performing loop
order for two different target architectures: i) a multi-core Intel Xeon Cascadelake and ii) a
many-core Intel Xeon Phi Knights Landing (KNL) and test them on loop nests from Polybench
test suite. We explore two different data prefetch configurations of the KNL architecture. The
SVM based classifiers are trained using the training data set generated by our tool, to identify
the optimal loop order. We then study the use of an ensemble model to make the prediction
of optimal loop order robust. Experimental evaluation demonstrates that, our ensemble model
identifies near-optimal loop orders that are within 8.80% and 6.54% of the optimal loop order
for the two test data sets studied on the Intel Xeon Cascadelake architecture. On the Intel
Xeon Phi Knights Landing, our ensemble model identifies near-optimal loop orders that are
within 3.43% and 7.20% of the optimal loop order for the two prefetch configurations studied.
Our approach outperforms state-of-the-art open-source compiler frameworks Polly and Pluto
with a geometric mean speedup of 1.08x to 1.87x.
Next, to identify the best-performing <tile size, loop order> combination, we propose to
develop an hierarchical classifier. We develop a carefully tuned hierarchical classifier, making
use of domain knowledge, where each level in the hierarchical classifier uses an SVM classifier.
The hierarchical classifier is trained using the synthetic loops generated by our tool and tested
on the same set of loops from Polybench test suite, for the same target architectures stated
above. Performance evaluation demonstrates that, the tile size and loop order predicted by our
SVM-based classifier results in transformed loops whose performance is within 18% and 9%
of the optimal performance for two different test data sets on Intel Xeon Cascadelake system
and 18% and 13% of the optimal performance for two different prefetch configurations on Intel
Xeon Phi (KNL) system. Further, our method outperforms Polly and Pluto with a geometric
mean speedup of 1.33x to 2.42x.
Last, to determine whether the unroll-and-jam optimization benefits a given loop and which
loop order performs the best, we propose to use an heuristic approach. Our simple yet practical
heuristic uses the characteristics of a given loop, represented as high-level features, to decide
whether unroll-and-jam is beneficial and the best-performing loop order. We supplement the
heuristic approach with a simple SVM classifier model and show both approaches perform
equally well. We evaluate our heuristic on loops from the Polybench test suite and a set of
loops generated using our synthetic loop generator tool on the Intel Xeon Cascadelake system.
Experimental evaluation demonstrates that, our approach results in performance that is within
6.51% and 12.97% of the best performance for Polybench and synthetic loops benchmark suites,
respectively and outperforms Polly and Pluto with a geometric mean speedup of 1.15x to 1.79x.