Show simple item record

dc.contributor.advisorNandy, S K
dc.contributor.advisorRaha, Soumyendu
dc.contributor.authorDas, Saptarsi
dc.date.accessioned2021-09-21T06:37:57Z
dc.date.available2021-09-21T06:37:57Z
dc.date.submitted2018
dc.identifier.urihttps://etd.iisc.ac.in/handle/2005/5313
dc.description.abstractAccelerating high performance computing (HPC) applications such as dense linear al- gebra solvers, mesh computations, stencil computations requires exploiting parallelism that is resident in loops. Typically these loops have simple structures and they form the so called static control parts of the HPC applications. In this context affine loops are of particular interest because of their amenability to automatic parallelization. Efficient execution of such applications demand computing platforms that are capable of exploit- ing parallelism of various granularities. In the past decades architects and hardware designers have exploited the exponential growth in device density on silicon to meet the ever increasing demand for parallel execution on hardware. As we enter the deep sub- micron era, various factors including the so called power wall will impede the traditional approach of architecture design. It will no longer be bene ficial to design homogeneous multicores where each core is structurally and functionally identical. In order to over- come the challenges of the future, a heterogeneous design philosophy has to be adopted. We see some re flection of that already in the state of the art - application specifi c on- chip accelerators and specialized processing platforms such as graphics processing units have become common in present generation of computing platforms. When compared to general purpose processors (GPP), although application specifi c accelerators offer dra- matically higher efficiency for their target applications, they are not as flexible and/or performs poorly on other applications. Graphic processing units (GPU) can be used for accelerating a wide range of parallel applications. However GPUs are extremely energy consuming. Field programmable gate arrays (FPGA) may be used to generate acceler- ators on demand. Although this mitigates the flexibility issue involved with specialized hardware accelerators, the ner granularity of the lookup tables (LUT) in FPGAs leads to signi ficantly high con figuration time and low operating frequency. Coarse-grain re- con figurable architectures (CGRA) accelerators consisting of a pool of compute elements (CE) interconnected using some communication infrastructure overcomes the reconfi gu- ration overheads of FPGAs while providing performance close to specialized hardware accelerators. Examples of CGRAs include Convey Hybrid-Core Computer, DRRA, REDEFINE etc. Modern CGRAs are fundamentally ne grain instruction processing engines. In or- der to avoid the overheads of ne grain instruction processing and the flexibility issue of application speci fic accelerators, reconfigurable function units have been designed to act as accelerators that work in a tightly coupled manner with their host GPP. The so called reconfigurable function units are essentially a number of compute units such as ALUs or FPUs interconnected using a programmable interconnect. Example of such accelerators include DySER, CRFU etc. Such reconfigurable accelerators are particularly suitable for accelerating loops with large bounds because, the large number of iterations of loops ef- fectively amortizes the con figuration overheads. The reconfigurable accelerators decouple control and data movement from the core computation, much like the decoupled access execute architecture of yore. This decoupled execution style allows dedicated hardware resources for computation and control thereby improving efficiency of the computation resources. However, since the accelerators lack dedicated micro-architectural support for control and movement of data, the pipeline of the host GPP has to act as the control hardware. The micro-architectural limitations of the host GPP such as limited storage and read/write bandwidth of register les affect the performance of the said accelerators. In order to avoid this pitfall, we propose a reconfigurable accelerator called HyperCell which is inspired by the decoupled execution style and is supported by dedicated control hardware and temporary operand storage. The HyperCell can be con gfiured once and it can execute a large number of iterations of a loop without direct intervention from the host. This reduces control overhead signi ficantly. The dedicated operand storage enables temporal reuse of data and reduces data movement overhead. The recon figurable dat- apath of HyperCell allows exploitation of ne grain instruction level parallelism (ILP) while the controller enables pipelined execution of successive iterations of the loop and thereby enables exploitation of ne grain data level parallelism (DLP). In order to exploit higher degree of parallelism, we connect a multitude of HyperCells using a scalable network on chip. The architecture of the network of HyperCells based on the REDEFINE archetype. Hence we refer to the proposed accelerator as REDEFINE HyperCell Multicore (RHyMe). The network of HyperCells form the compute fabric of RHyMe. The HyperCells are capable of concurrently executing partitions (so called tiles) of an iteration space of a loop. The multiplicity of HyperCells enable exploitation of coarse grain DLP. In order to reduce data movement overheads between the accelera- tor and its host processor, we introduce a distributed shared memory that the compute fabric can utilize as its operand store. Control overheads are minimized through intro- duction of a dedicated orchestrator module that governs execution of tiles of an iteration space on the compute fabric of RHyMe. Through experimental results we quantitatively demonstrate that our proposed accelerator incurs minimum control overhead in terms of performance, hardware complexity of the dedicated orchestrator and energy consump- tion. For certain kernels, we observe that the fraction of computation time and total execution time is more than 99%. The con gfiuration and synchronization overhead is less than 1%. We also demonstrate that the accelerator is capable of exploiting multi-grain parallelism and temporal reuse of operand data. We measured data-movement over- head for different kernels. For kernels with greater scope of temporal reuse of data, our proposed hardware is capable of effectively hiding data movement latencies. We mea- sured the relative cost of computation, control and data movement in terms of energy spendings. We observe that kernels with significant scope of data reuse results in more than 80% of the energy being spent in computation. We achieve performance ranging from 8.24 to 20.64 GFLOPS for the various kernels. By effectively exploiting parallelism and temporal reuse of data, RHyMe is able to achieve power efficiency of up to 16.86 GFLOPS/Watten_US
dc.language.isoen_USen_US
dc.relation.ispartofseries;G29372
dc.rightsI grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertationen_US
dc.subjecthigh performance computingen_US
dc.subjectparallelismen_US
dc.subjectHyperCellen_US
dc.subjectRHyMeen_US
dc.subject.classificationResearch Subject Categories::TECHNOLOGY::Information technology::Computer scienceen_US
dc.titleReconfigurable Accelerator for High Performance Application Kernelsen_US
dc.typeThesisen_US
dc.degree.namePhDen_US
dc.degree.levelDoctoralen_US
dc.degree.grantorIndian Institute of Scienceen_US
dc.degree.disciplineEngineeringen_US


Files in this item

This item appears in the following Collection(s)

Show simple item record