dc.description.abstract | GPGPUs have emerged as high-performance computing platforms and are used for boosting
the performance of general non-graphics applications from various scientifi c domains. These
applications span varied areas like social networks, defense, bioinformatics, data science, med-
ical imaging,
uid dynamics, etc [3]. In order to efficiently exploit the computing potential of
these accelerators, the application should be well mapped to the underlying architecture. As
a result, different characteristics of behaviors can be seen in applications running on GPG-
PUs. Applications are characterized as regular or irregular based on their behavior. Regular
applications typically operate on array-like data structures whose run-time behavior can be
statically predicted whereas irregular applications operate on pointer-based data structures
like graphs, trees etc. [2]. Irregular applications are generally characterized by the presence
of high degree of data-dependent control
flow and memory accesses. In the literature, we
nd various efforts to characterize such applications, particularly the irregular ones which
exhibit behavior that results in run-time bottlenecks. Burtscher et al [2] investigated various
irregular GPGPU applications by quantifying control
flow and memory access behaviors on
a real GPU device. Molly et al. [4] analyzed performance aspects of these behaviors on a
cycle-accurate GPU simulator [1]. Qiumin Xu et al. [5] studied execution characteristics of
graph-based applications on GPGPUs. All of these works focused on characterizing the di-
vergences at the kernel level but not at the thread level. In this work, we provide an in-depth
characterization of three divergences resulting from 1) Workload distribution, 2) Memory
access and 3) Control-
flow behaviors at different levels of the GPU thread hierarchy with the
purpose of analyzing and quantifying the divergence characteristics at warp, thread block,
and kernel level.
In Chapter 1, we review certain aspects of CPUs, GPUs and how they are different from
each other. Then we discuss various characteristics of GPGPU applications. In Chapter
2, we provide background on GPU architectures, CUDA programming models, and GPUs
SIMD execution model. We briefly explain key programming concepts of CUDA like GPUs thread hierarchy and different addressable memory spaces. We describe various behaviors
that cause divergence across the parallel threads. We then review the related work in the
context of divergence we studied in this work followed by this thesis contribution. In Chapter
3, we explain our methodology for quantifying the workload and branch divergence across
the threads at various levels of thread organization. We then present our characterization
methodology to quantify divergent aspects of memory instructions. In Chapter 4, we present
our chosen benchmarks taken from various suites and show the baseline GPGPU-Sim con g-
fiuration we used for evaluating our methodology. Then we discuss our characterization results
for workload and branch divergence at warp, thread-block and kernel level for some interest-
ing kernels of applications. We examine graph-based application divergence behaviors and
show how it varies across threads. We present our characterization for memory access be-
haviors of irregular applications using instruction classi fication based on spatial locality. We
then discuss the relationship between the throughput and divergence measures by studying
their correlation coefficients.
To summarize, we quantifi ed and analyzed the control-
flow and workload divergence
across the threads at warp, thread-block and kernel level for a diverse collection of 12 GPGPU
applications which exhibit both regular and irregular behaviors. By using thread's hardware
utilization efficiency and a measure we call `Average normalized instructions per thread',
we quantify branch and workload divergence respectively. Our characterization technique
for memory divergence classi es memory instructions into four different groups based on
the property of intra-warp spatial locality of instructions. We then quantify the impact of
memory divergence using the behavior of GPU L1 data cache. | en_US |