A Divide and Conquer Framework For Graph Processing in Distributed Heterogeneous Systems
In many fields of science and engineering graph data structures are used to represent real-world information. As these graphs scale in size, it becomes very inefficient to process these graphs on a single core CPU using a sequential algorithm. Hence, multi-core and many-core architectures are very common for processing such large datasets. With the growth of accelerator based processors like GPGPU and Xeon-Phi, the baseline architecture has shifted to multi-threaded heterogeneous systems which supercomputers potentially rely on to exploit peak computing power. Large scale graph processing on such heterogeneous architectures (multi-core CPU and GPGPU) often results in under-utilization of the computing resources as these applications mostly have high communication and synchronization overheads. Hence, efficient processing of graphs on such heterogeneous platforms poses interesting and challenging problems. There is an increasing demand for graph processing frameworks which can provide high-level APIs to users to automatically harness the underlying parallel hardware. In order to process large graphs frameworks for multi-node distributed memory platforms is also very common. It is also important to design a framework which can automatically parallelize the graph algorithms in a multi-node multi-device environment. Efficient processing of graph algorithms on heterogeneous CPU-GPU systems requires effectively harnessing the combined power of both the CPU and GPU devices. In this thesis, we first present HyPar a novel framework for graph processing in heterogeneous architectures based on the divide-and-conquer paradigm. The framework partitions a given graph across the devices and performs simultaneous independent computations on both the devices. The framework provides some simple and generic APIs, supported with efficient runtime strategies for hybrid executions. Using experiments with minimum spanning tree, community detection, triangle counting, graph coloring and connected component applications on a heterogeneous system, we show that our HyPar framework provides average performance improvements of 22-74% over the state-of-art, optimized CPU-only and GPU-only implementations of the corresponding applications. When compared to the prevalent BSP approach for multi-device executions of graphs our HyPar framework yields 74-92% average performance improvements. When compared with Ligra, a state-of-art multi-core graph processing framework, we obtain average performance improvements of 24-59%. While a single node heterogeneous framework can help explore medium size graphs it is also important to design an engine which can help explore very large size real-world graphs. Moreover, the large scale HPC systems can also provide substantial bene ts over the single node heterogeneous systems for these real-world data sets. Hence, we extended our HyPar framework to distributed heterogeneous settings. We partition the graph across cluster nodes and further to the devices in a single node using HyPar for independent processing. We then merge the results from the different nodes using a novel hybrid merging algorithm to ensure that the combined results on a node never exceeds it memory capacity. In our experiments, we show that our proposed algorithm shows large-scale improvements over an existing stateof- art approach. We also show that the algorithm exhibits almost linear scalability, and the use of GPUs result in up to 23% improvement in performance over our multi-node CPU-only implementation.