Optimizing Dense Matrix Computations with PolyMage
Abstract
Linear algebra computations and other arbitrary affine accesses are ubiquitous in applications
from domains like scientific computing, digital signal processing (DSP), and deep
neural networks. Libraries such as OpenBLAS, MKL, and FFTW provide efficient handoptimized
implementations for matrix and vector primitives used in these domains for
various architectures. Applications are then built upon these standard library routines
to obtain high performance. These libraries do not perform well for all matrix sizes and
obtain sub-optimal performance for small matrices. The interface of these libraries can
also be fairly complex requiring several input parameters. Thus, an even higher level of
abstraction is often desired to improve productivity. Further, by using these libraries the
opportunity to optimize across different library calls is lost. Traditional programming to
exploit this locality using library functions becomes complex.
The work in this thesis proposes (i) a tile size selection model which works for any
arbitrary affine access and eschews auto-tuning, (ii) a simple heuristic to determine the
profitability of library call mapping and falling back to generated code otherwise, (iii) an
intra-tile optimization technique to expose inner-loop parallelism thus enabling general
purpose compiler’s vectorizer to generate vector instructions, (iv) a DSL approach with
high level primitives and functions to allow expressing computations efficiently.
The optimizations proposed are implemented in the PolyMage DSL. PolyMage is a domain
specific language (DSL) originally designed for image processing pipelines. Its optimizer
is able to perform optimizations like fusion, tiling, and loop optimizations for image
processing pipelines. Firstly, PolyMage’s language specification is extended to support additional
functions and primitives for matrix computations and perform domain inference automatically. Secondly, PolyMage compiler’s fusion and tile size selection heuristics are
extended to work with arbitrary affine accesses. It is thus able to optimize computations
from additional domains including dense linear algebra and certain DSP computations.
The heuristic to map matrix operations to corresponding optimized library implementations
are incorporated with the help of an idiom recognition algorithm.
The thesis finally experimentally evaluates these optimizations on modern multicore
systems using representative benchmarks from PolyBench, digital signal processing, and
image processing. The results are compared to state-of-the-art optimization approaches
and frameworks in each domain. Experiments on DSP benchmarks show that our optimizations
give a mean speed up of 7.7 over the existing PolyMage optimizer, 5.1 over
the Python numpy package and 1.9 over MATLAB toolboxes with parallel computing support.
Linear algebra computations from the PolyBench benchmark suite obtain an average
a speedup of 21.7 over the existing PolyMage optimizer, 3.6 over Pluto and 4.1 over
PPCG.