Reconfigurable Accelerator for High Performance Application Kernels
Abstract
Accelerating high performance computing (HPC) applications such as dense linear al-
gebra solvers, mesh computations, stencil computations requires exploiting parallelism
that is resident in loops. Typically these loops have simple structures and they form the
so called static control parts of the HPC applications. In this context affine loops are
of particular interest because of their amenability to automatic parallelization. Efficient
execution of such applications demand computing platforms that are capable of exploit-
ing parallelism of various granularities. In the past decades architects and hardware
designers have exploited the exponential growth in device density on silicon to meet the
ever increasing demand for parallel execution on hardware. As we enter the deep sub-
micron era, various factors including the so called power wall will impede the traditional
approach of architecture design. It will no longer be bene ficial to design homogeneous
multicores where each core is structurally and functionally identical. In order to over-
come the challenges of the future, a heterogeneous design philosophy has to be adopted.
We see some re
flection of that already in the state of the art - application specifi c on-
chip accelerators and specialized processing platforms such as graphics processing units
have become common in present generation of computing platforms. When compared to
general purpose processors (GPP), although application specifi c accelerators offer dra-
matically higher efficiency for their target applications, they are not as
flexible and/or
performs poorly on other applications. Graphic processing units (GPU) can be used for
accelerating a wide range of parallel applications. However GPUs are extremely energy
consuming. Field programmable gate arrays (FPGA) may be used to generate acceler-
ators on demand. Although this mitigates the
flexibility issue involved with specialized
hardware accelerators, the ner granularity of the lookup tables (LUT) in FPGAs leads
to signi ficantly high con figuration time and low operating frequency. Coarse-grain re-
con figurable architectures (CGRA) accelerators consisting of a pool of compute elements
(CE) interconnected using some communication infrastructure overcomes the reconfi gu-
ration overheads of FPGAs while providing performance close to specialized hardware
accelerators.
Examples of CGRAs include Convey Hybrid-Core Computer, DRRA, REDEFINE
etc. Modern CGRAs are fundamentally ne grain instruction processing engines. In or-
der to avoid the overheads of ne grain instruction processing and the
flexibility issue of
application speci fic accelerators, reconfigurable function units have been designed to act
as accelerators that work in a tightly coupled manner with their host GPP. The so called
reconfigurable function units are essentially a number of compute units such as ALUs or
FPUs interconnected using a programmable interconnect. Example of such accelerators
include DySER, CRFU etc. Such reconfigurable accelerators are particularly suitable for
accelerating loops with large bounds because, the large number of iterations of loops ef-
fectively amortizes the con figuration overheads. The reconfigurable accelerators decouple
control and data movement from the core computation, much like the decoupled access
execute architecture of yore. This decoupled execution style allows dedicated hardware
resources for computation and control thereby improving efficiency of the computation
resources. However, since the accelerators lack dedicated micro-architectural support for
control and movement of data, the pipeline of the host GPP has to act as the control
hardware. The micro-architectural limitations of the host GPP such as limited storage
and read/write bandwidth of register les affect the performance of the said accelerators.
In order to avoid this pitfall, we propose a reconfigurable accelerator called HyperCell
which is inspired by the decoupled execution style and is supported by dedicated control
hardware and temporary operand storage. The HyperCell can be con gfiured once and it
can execute a large number of iterations of a loop without direct intervention from the
host. This reduces control overhead signi ficantly. The dedicated operand storage enables
temporal reuse of data and reduces data movement overhead. The recon figurable dat-
apath of HyperCell allows exploitation of ne grain instruction level parallelism (ILP)
while the controller enables pipelined execution of successive iterations of the loop and
thereby enables exploitation of ne grain data level parallelism (DLP).
In order to exploit higher degree of parallelism, we connect a multitude of HyperCells
using a scalable network on chip. The architecture of the network of HyperCells based on
the REDEFINE archetype. Hence we refer to the proposed accelerator as REDEFINE
HyperCell Multicore (RHyMe). The network of HyperCells form the compute fabric
of RHyMe. The HyperCells are capable of concurrently executing partitions (so called
tiles) of an iteration space of a loop. The multiplicity of HyperCells enable exploitation
of coarse grain DLP. In order to reduce data movement overheads between the accelera-
tor and its host processor, we introduce a distributed shared memory that the compute
fabric can utilize as its operand store. Control overheads are minimized through intro-
duction of a dedicated orchestrator module that governs execution of tiles of an iteration
space on the compute fabric of RHyMe. Through experimental results we quantitatively
demonstrate that our proposed accelerator incurs minimum control overhead in terms of
performance, hardware complexity of the dedicated orchestrator and energy consump-
tion. For certain kernels, we observe that the fraction of computation time and total
execution time is more than 99%. The con gfiuration and synchronization overhead is less
than 1%. We also demonstrate that the accelerator is capable of exploiting multi-grain
parallelism and temporal reuse of operand data. We measured data-movement over-
head for different kernels. For kernels with greater scope of temporal reuse of data, our
proposed hardware is capable of effectively hiding data movement latencies. We mea-
sured the relative cost of computation, control and data movement in terms of energy
spendings. We observe that kernels with significant scope of data reuse results in more
than 80% of the energy being spent in computation. We achieve performance ranging
from 8.24 to 20.64 GFLOPS for the various kernels. By effectively exploiting parallelism
and temporal reuse of data, RHyMe is able to achieve power efficiency of up to 16.86
GFLOPS/Watt