Optimizing Matrix Multiplication for the REDEFINE Many-Core Co-processor

Kulkarni, Pratik

View/Open

Thesis full text (940.3Kb)

Author

Kulkarni, Pratik

Metadata

Show full item record

Abstract

Matrix-matrix multiplication is an important operation for many applications and hence it is required to be parallelized optimally for the architecture the applications will run on. REDE- FINE is a many-core co-processor and a high-performance implementation of matrix-matrix multiplication requires utilizing the different memory layers to amortize the movement between them with useful computation. This thesis presents an approach aimed at achieving this goal along with distributing load between the nodes of REDEFINE to achieve good load-balancing. The approach described in the thesis first makes use of the elemental distribution of matrices which apart from achieving good load-balancing also decouples the storage blocking size and the algorithmic blocking size by setting the storage blocking size to 1. Then the BLIS framework which breaks the inner kernel of the GotoBLAS GEMM implementation into two additional loops around a micro-kernel is employed on every node. A modified eSUMMA2D-C algorithm enables using the elemental distribution of matrices with the BLIS framework. The BLIS framework not only allows for exploiting the memory layers of REDEFINE to amortize the cost of moving data between the memory layers but also provides with opportunities for parallelism within the inner kernel of the GotoBLAS implementation. The fork-join model has been used for parallelization as well as for synchronization. The values of the blocking parameters have been chosen such that they will enable optimal amorti- zation of moving data between memory layers with computation and also achieve a high degree of parallelism from the loops being parallelized. Thus, this approach integrates the elemental distribution of matrices with the BLIS framework to achieve good load-balancing between the nodes of REDEFINE along with optimal amortization of moving elements of matrices between the different memory layers with computation and a high degree of parallelization.

URI

https://etd.iisc.ac.in/handle/2005/5800

Collections

Department of Computational and Data Sciences (CDS) [102]