Optimizing Dense Matrix Computations with PolyMage
Kumudha, K N
MetadataShow full item record
Linear algebra computations and other arbitrary affine accesses are ubiquitous in applications from domains like scientific computing, digital signal processing (DSP), and deep neural networks. Libraries such as OpenBLAS, MKL, and FFTW provide efficient handoptimized implementations for matrix and vector primitives used in these domains for various architectures. Applications are then built upon these standard library routines to obtain high performance. These libraries do not perform well for all matrix sizes and obtain sub-optimal performance for small matrices. The interface of these libraries can also be fairly complex requiring several input parameters. Thus, an even higher level of abstraction is often desired to improve productivity. Further, by using these libraries the opportunity to optimize across different library calls is lost. Traditional programming to exploit this locality using library functions becomes complex. The work in this thesis proposes (i) a tile size selection model which works for any arbitrary affine access and eschews auto-tuning, (ii) a simple heuristic to determine the profitability of library call mapping and falling back to generated code otherwise, (iii) an intra-tile optimization technique to expose inner-loop parallelism thus enabling general purpose compiler’s vectorizer to generate vector instructions, (iv) a DSL approach with high level primitives and functions to allow expressing computations efficiently. The optimizations proposed are implemented in the PolyMage DSL. PolyMage is a domain specific language (DSL) originally designed for image processing pipelines. Its optimizer is able to perform optimizations like fusion, tiling, and loop optimizations for image processing pipelines. Firstly, PolyMage’s language specification is extended to support additional functions and primitives for matrix computations and perform domain inference automatically. Secondly, PolyMage compiler’s fusion and tile size selection heuristics are extended to work with arbitrary affine accesses. It is thus able to optimize computations from additional domains including dense linear algebra and certain DSP computations. The heuristic to map matrix operations to corresponding optimized library implementations are incorporated with the help of an idiom recognition algorithm. The thesis finally experimentally evaluates these optimizations on modern multicore systems using representative benchmarks from PolyBench, digital signal processing, and image processing. The results are compared to state-of-the-art optimization approaches and frameworks in each domain. Experiments on DSP benchmarks show that our optimizations give a mean speed up of 7.7 over the existing PolyMage optimizer, 5.1 over the Python numpy package and 1.9 over MATLAB toolboxes with parallel computing support. Linear algebra computations from the PolyBench benchmark suite obtain an average a speedup of 21.7 over the existing PolyMage optimizer, 3.6 over Pluto and 4.1 over PPCG.