Characterization of Divergence resulting from Workload, Memory and Control-Flow behavior in GPGPU Applications

Morkhande, Rahul Raj Kumar

View/Open

Thesis full text (3.281Mb)

Author

Morkhande, Rahul Raj Kumar

Metadata

Show full item record

Abstract

GPGPUs have emerged as high-performance computing platforms and are used for boosting the performance of general non-graphics applications from various scientifi c domains. These applications span varied areas like social networks, defense, bioinformatics, data science, med- ical imaging, uid dynamics, etc [3]. In order to efficiently exploit the computing potential of these accelerators, the application should be well mapped to the underlying architecture. As a result, different characteristics of behaviors can be seen in applications running on GPG- PUs. Applications are characterized as regular or irregular based on their behavior. Regular applications typically operate on array-like data structures whose run-time behavior can be statically predicted whereas irregular applications operate on pointer-based data structures like graphs, trees etc. [2]. Irregular applications are generally characterized by the presence of high degree of data-dependent control flow and memory accesses. In the literature, we nd various efforts to characterize such applications, particularly the irregular ones which exhibit behavior that results in run-time bottlenecks. Burtscher et al [2] investigated various irregular GPGPU applications by quantifying control flow and memory access behaviors on a real GPU device. Molly et al. [4] analyzed performance aspects of these behaviors on a cycle-accurate GPU simulator [1]. Qiumin Xu et al. [5] studied execution characteristics of graph-based applications on GPGPUs. All of these works focused on characterizing the di- vergences at the kernel level but not at the thread level. In this work, we provide an in-depth characterization of three divergences resulting from 1) Workload distribution, 2) Memory access and 3) Control- flow behaviors at different levels of the GPU thread hierarchy with the purpose of analyzing and quantifying the divergence characteristics at warp, thread block, and kernel level. In Chapter 1, we review certain aspects of CPUs, GPUs and how they are different from each other. Then we discuss various characteristics of GPGPU applications. In Chapter 2, we provide background on GPU architectures, CUDA programming models, and GPUs SIMD execution model. We briefly explain key programming concepts of CUDA like GPUs thread hierarchy and different addressable memory spaces. We describe various behaviors that cause divergence across the parallel threads. We then review the related work in the context of divergence we studied in this work followed by this thesis contribution. In Chapter 3, we explain our methodology for quantifying the workload and branch divergence across the threads at various levels of thread organization. We then present our characterization methodology to quantify divergent aspects of memory instructions. In Chapter 4, we present our chosen benchmarks taken from various suites and show the baseline GPGPU-Sim con g- fiuration we used for evaluating our methodology. Then we discuss our characterization results for workload and branch divergence at warp, thread-block and kernel level for some interest- ing kernels of applications. We examine graph-based application divergence behaviors and show how it varies across threads. We present our characterization for memory access be- haviors of irregular applications using instruction classi fication based on spatial locality. We then discuss the relationship between the throughput and divergence measures by studying their correlation coefficients. To summarize, we quantifi ed and analyzed the control- flow and workload divergence across the threads at warp, thread-block and kernel level for a diverse collection of 12 GPGPU applications which exhibit both regular and irregular behaviors. By using thread's hardware utilization efficiency and a measure we call `Average normalized instructions per thread', we quantify branch and workload divergence respectively. Our characterization technique for memory divergence classi es memory instructions into four different groups based on the property of intra-warp spatial locality of instructions. We then quantify the impact of memory divergence using the behavior of GPU L1 data cache.

URI

https://etd.iisc.ac.in/handle/2005/5453

Collections

Department of Computational and Data Sciences (CDS) [102]