Hardware Consolidation Of Systolic Algorithms On A Coarse Grained Runtime Reconfigurable Architecture
MetadataShow full item record
Application domains such as Bio-informatics, DSP, Structural Biology, Fluid Dynamics, high resolution direction finding, state estimation, adaptive noise cancellation etc. demand high performance computing solutions for their simulation environments. The core computations of these applications are in Numerical Linear Algebra (NLA) kernels. Direct solvers are predominantly required in the domains like DSP, estimation algorithms like Kalman Filter etc, where the matrices on which operations need to be performed are either small or medium sized, but dense. Faddeev's Algorithm is often used for solving dense linear system of equations. Modified Faddeev's algorithm (MFA) is a general algorithm on which LU decomposition, QR factorization or SVD of matrices can be realized. MFA has the good property of realizing a host of matrix operations by computing the Schur complements on four blocked matrices, thereby reducing the overall computation requirements. We will use MFA as a representative Direct Solver in this work. We further discuss Given's rotation based QR algorithm for Decomposition of any matrix, often used to solve the linear least square problem. Systolic Array Architectures are widely accepted ASIC solutions for NLA algorithms. But the \can of worms" associated with this traditional solution spawns the need for alternative solutions. While popular custom hardware solution in form of systolic arrays can deliver high performance, but because of their rigid structure they are not scalable and reconfigurable, and hence not commercially viable. We show how a Reconfigurable computing platform can serve to contain the \can of worms". REDEFINE, a coarse grained runtime reconfigurable architecture has been used for systolic actualization of NLA kernels. We elaborate upon streaming NLA-specific enhancements to REDEFINE in order to meet expected performance goals. We explore the need for an algorithm aware custom compilation framework. We bring about a proposition to realize Faddeev's Algorithm on REDEFINE. We show that REDEFINE performs several times faster than traditional GPPs. Further we direct our interest to QR Decomposition to be the next NLA kernel as it ensures better stability than LU and other decompositions. We use QR Decomposition as a case study to explore the design space of the proposed solution on REDEFINE. We also investigate the architectural details of the Custom Functional Units (CFU) for these NLA kernels. We determine the right size of the sub-array in accordance with the optimal pipeline depth of the core execution units and the number of such units to be used per sub-array. The framework used to realize QR Decomposition can be generalized for the realization of other algorithms dealing with decompositions like LU, Faddeev's Algorithm, Gauss-Jordon etc with different CFU definitions .