Algorithm-Architecture Co-Design for Dense Linear Algebra Computations

Merchant, Farhad

dc.contributor.advisor	Nandy, S K
dc.contributor.author	Merchant, Farhad
dc.date.accessioned	2018-08-13T13:21:15Z
dc.date.accessioned	2018-08-28T09:14:15Z
dc.date.available	2018-08-13T13:21:15Z
dc.date.available	2018-08-28T09:14:15Z
dc.date.issued	2018-08-13
dc.date.submitted	2015
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/3958
dc.identifier.abstract	https://etd.iisc.ac.in/static/etd/abstracts/4831/G27243-Abs.pdf	en_US
dc.description.abstract	Achieving high computation efficiency, in terms of Cycles per Instruction (CPI), for high-performance computing kernels is an interesting and challenging research area. Dense Linear Algebra (DLA) computation is a representative high-performance computing ap- plication, which is used, for example, in LU and QR factorizations. Unfortunately, mod- ern off-the-shelf microprocessors fall significantly short of achieving theoretical lower bound in CPI for high performance computing applications. In this thesis, we perform an in-depth analysis of the available parallelisms and propose suitable algorithmic and architectural variation to significantly improve the computation efficiency. There are two standard approaches for improving the computation effficiency, first, to perform application-specific architecture customization and second, to do algorithmic tuning. In the same manner, we first perform a graph-based analysis of selected DLA kernels. From the various forms of parallelism, thus identified, we design a custom processing element for improving the CPI. The processing elements are used as building blocks for a commercially available Coarse-Grained Reconfigurable Architecture (CGRA). By per- forming detailed experiments on a synthesized CGRA implementation, we demonstrate that our proposed algorithmic and architectural variations are able to achieve lower CPI compared to off-the-shelf microprocessors. We also benchmark against state-of-the-art custom implementations to report higher energy-performance-area product. DLA computations are encountered in many engineering and scientific computing ap- plications ranging from Computational Fluid Dynamics (CFD) to Eigenvalue problem. Traditionally, these applications are written in highly tuned High Performance Comput- ing (HPC) software packages like Linear Algebra Package (LAPACK), and/or Scalable Linear Algebra Package (ScaLAPACK). The basic building block for these packages is Ba- sic Linear Algebra Subprograms (BLAS). Algorithms pertaining LAPACK/ScaLAPACK are written in-terms of BLAS to achieve high throughput. Despite extensive intellectual efforts in development and tuning of these packages, there still exists a scope for fur- ther tuning in this packages. In this thesis, we revisit most prominent and widely used compute bound algorithms like GMM for further exploitation of Instruction Level Parallelism (ILP). We further look into LU and QR factorizations for generalizations and exhibit higher ILP in these algorithms. We first accelerate sequential performance of the algorithms in BLAS and LAPACK and then focus on the parallel realization of these algorithms. Major contributions in the algorithmic tuning in this thesis are as follows: Algorithms: We present graph based analysis of General Matrix Multiplication (GMM) and discuss different types of parallelisms available in GMM We present analysis of Givens Rotation based QR factorization where we improve GR and derive Column-wise GR (CGR) that can annihilate multiple elements of a column of a matrix simultaneously. We show that the multiplications in CGR are lower than GR We generalize CGR further and derive Generalized GR (GGR) that can annihilate multiple elements of the columns of a matrix simultaneously. We show that the parallelism exhibited by GGR is much higher than GR and Householder Transform (HT) We extend generalizations to Square root Free GR (also knows as Fast Givens Rotation) and Square root and Division Free GR (SDFG) and derive Column-wise Fast Givens, and Column-wise SDFG . We also extend generalization for complex matrices and derive Complex Column-wise Givens Rotation Coarse-grained Recon gurable Architectures (CGRAs) have gained popularity in the last decade due to their power and area efficiency. Furthermore, CGRAs like REDEFINE also exhibit support for domain customizations. REDEFINE is an array of Tiles where each Tile consists of a Compute Element and a Router. The Routers are responsible for on-chip communication, while Compute Elements in the REDEFINE can be domain customized to accelerate the applications pertaining to the domain of interest. In this thesis, we consider REDEFINE base architecture as a starting point and we design Processing Element (PE) that can execute algorithms in BLAS and LAPACK efficiently. We perform several architectural enhancements in the PE to approach lower bound of the CPI. For parallel realization of BLAS and LAPACK, we attach this PE to the Router of REDEFINE. We achieve better area and power performance compared to the yesteryear customized architecture for DLA. Major contributions in architecture in this thesis are as follows: Architecture: We present design of a PE for acceleration of GMM which is a Level-3 BLAS operation We methodically enhance the PE with different features for improvement in the performance of GMM For efficient realization of Linear Algebra Package (LAPACK), we use PE that can efficiently execute GMM and show better performance For further acceleration of LU and QR factorizations in LAPACK, we identify macro operations encountered in LU and QR factorizations, and realize them on a reconfigurable data-path resulting in 25-30% lower run-time	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	G27243	en_US
dc.subject	Dense Linear Algebra (DLA)	en_US
dc.subject	Algorithms-Applications	en_US
dc.subject	Parallesing	en_US
dc.subject	Algorithmic Structural Variations	en_US
dc.subject	Dense Linear Algebra	en_US
dc.subject	General Matrix Multiplication (GMM)	en_US
dc.subject	Column-wise Givens Rotation (CGR)	en_US
dc.subject	QR Factorization	en_US
dc.subject	Basic Linear Algebra Subprograms (BLAS)	en_US
dc.subject	Linear Algebra Package (LAPACK)	en_US
dc.subject	Coarse-grained Reconfigurable Architectures (CGRAs)	en_US
dc.subject.classification	Electronic Systems Engineering	en_US
dc.title	Algorithm-Architecture Co-Design for Dense Linear Algebra Computations	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.discipline	Faculty of Engineering	en_US

Files in this item

Name:: G27243.pdf
Size:: 3.501Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Electronic Systems Engineering (ESE) [189]

Show simple item record

Algorithm-Architecture Co-Design for Dense Linear Algebra Computations

Files in this item

This item appears in the following Collection(s)

Related items

Hardware Consolidation Of Systolic Algorithms On A Coarse Grained Runtime Reconfigurable Architecture ﻿

Hardware-Software Co-Design Accelerators for Sparse BLAS ﻿

Numerical Analysis of Some Preconditioners and Associated Error Estimators for Solving Linear Systems ﻿

Hardware Consolidation Of Systolic Algorithms On A Coarse Grained Runtime Reconfigurable Architecture

Hardware-Software Co-Design Accelerators for Sparse BLAS

Numerical Analysis of Some Preconditioners and Associated Error Estimators for Solving Linear Systems