Reconfigurable Accelerator for High Performance Application Kernels

Das, Saptarsi

View/Open

Thesis full text (11.88Mb)

Author

Das, Saptarsi

Metadata

Show full item record

Abstract

Accelerating high performance computing (HPC) applications such as dense linear al- gebra solvers, mesh computations, stencil computations requires exploiting parallelism that is resident in loops. Typically these loops have simple structures and they form the so called static control parts of the HPC applications. In this context affine loops are of particular interest because of their amenability to automatic parallelization. Efficient execution of such applications demand computing platforms that are capable of exploit- ing parallelism of various granularities. In the past decades architects and hardware designers have exploited the exponential growth in device density on silicon to meet the ever increasing demand for parallel execution on hardware. As we enter the deep sub- micron era, various factors including the so called power wall will impede the traditional approach of architecture design. It will no longer be bene ficial to design homogeneous multicores where each core is structurally and functionally identical. In order to over- come the challenges of the future, a heterogeneous design philosophy has to be adopted. We see some re flection of that already in the state of the art - application specifi c on- chip accelerators and specialized processing platforms such as graphics processing units have become common in present generation of computing platforms. When compared to general purpose processors (GPP), although application specifi c accelerators offer dra- matically higher efficiency for their target applications, they are not as flexible and/or performs poorly on other applications. Graphic processing units (GPU) can be used for accelerating a wide range of parallel applications. However GPUs are extremely energy consuming. Field programmable gate arrays (FPGA) may be used to generate acceler- ators on demand. Although this mitigates the flexibility issue involved with specialized hardware accelerators, the ner granularity of the lookup tables (LUT) in FPGAs leads to signi ficantly high con figuration time and low operating frequency. Coarse-grain re- con figurable architectures (CGRA) accelerators consisting of a pool of compute elements (CE) interconnected using some communication infrastructure overcomes the reconfi gu- ration overheads of FPGAs while providing performance close to specialized hardware accelerators. Examples of CGRAs include Convey Hybrid-Core Computer, DRRA, REDEFINE etc. Modern CGRAs are fundamentally ne grain instruction processing engines. In or- der to avoid the overheads of ne grain instruction processing and the flexibility issue of application speci fic accelerators, reconfigurable function units have been designed to act as accelerators that work in a tightly coupled manner with their host GPP. The so called reconfigurable function units are essentially a number of compute units such as ALUs or FPUs interconnected using a programmable interconnect. Example of such accelerators include DySER, CRFU etc. Such reconfigurable accelerators are particularly suitable for accelerating loops with large bounds because, the large number of iterations of loops ef- fectively amortizes the con figuration overheads. The reconfigurable accelerators decouple control and data movement from the core computation, much like the decoupled access execute architecture of yore. This decoupled execution style allows dedicated hardware resources for computation and control thereby improving efficiency of the computation resources. However, since the accelerators lack dedicated micro-architectural support for control and movement of data, the pipeline of the host GPP has to act as the control hardware. The micro-architectural limitations of the host GPP such as limited storage and read/write bandwidth of register les affect the performance of the said accelerators. In order to avoid this pitfall, we propose a reconfigurable accelerator called HyperCell which is inspired by the decoupled execution style and is supported by dedicated control hardware and temporary operand storage. The HyperCell can be con gfiured once and it can execute a large number of iterations of a loop without direct intervention from the host. This reduces control overhead signi ficantly. The dedicated operand storage enables temporal reuse of data and reduces data movement overhead. The recon figurable dat- apath of HyperCell allows exploitation of ne grain instruction level parallelism (ILP) while the controller enables pipelined execution of successive iterations of the loop and thereby enables exploitation of ne grain data level parallelism (DLP). In order to exploit higher degree of parallelism, we connect a multitude of HyperCells using a scalable network on chip. The architecture of the network of HyperCells based on the REDEFINE archetype. Hence we refer to the proposed accelerator as REDEFINE HyperCell Multicore (RHyMe). The network of HyperCells form the compute fabric of RHyMe. The HyperCells are capable of concurrently executing partitions (so called tiles) of an iteration space of a loop. The multiplicity of HyperCells enable exploitation of coarse grain DLP. In order to reduce data movement overheads between the accelera- tor and its host processor, we introduce a distributed shared memory that the compute fabric can utilize as its operand store. Control overheads are minimized through intro- duction of a dedicated orchestrator module that governs execution of tiles of an iteration space on the compute fabric of RHyMe. Through experimental results we quantitatively demonstrate that our proposed accelerator incurs minimum control overhead in terms of performance, hardware complexity of the dedicated orchestrator and energy consump- tion. For certain kernels, we observe that the fraction of computation time and total execution time is more than 99%. The con gfiuration and synchronization overhead is less than 1%. We also demonstrate that the accelerator is capable of exploiting multi-grain parallelism and temporal reuse of operand data. We measured data-movement over- head for different kernels. For kernels with greater scope of temporal reuse of data, our proposed hardware is capable of effectively hiding data movement latencies. We mea- sured the relative cost of computation, control and data movement in terms of energy spendings. We observe that kernels with significant scope of data reuse results in more than 80% of the energy being spent in computation. We achieve performance ranging from 8.24 to 20.64 GFLOPS for the various kernels. By effectively exploiting parallelism and temporal reuse of data, RHyMe is able to achieve power efficiency of up to 16.86 GFLOPS/Watt

URI

https://etd.iisc.ac.in/handle/2005/5313

Collections

Centre for Nano Science and Engineering (CeNSE) [159]