A Divide and Conquer Framework For Graph Processing in Distributed Heterogeneous Systems
Abstract
In many fields of science and engineering graph data structures are used to represent real-world
information. As these graphs scale in size, it becomes very inefficient to process these graphs on
a single core CPU using a sequential algorithm. Hence, multi-core and many-core architectures
are very common for processing such large datasets. With the growth of accelerator based
processors like GPGPU and Xeon-Phi, the baseline architecture has shifted to multi-threaded
heterogeneous systems which supercomputers potentially rely on to exploit peak computing
power.
Large scale graph processing on such heterogeneous architectures (multi-core CPU and
GPGPU) often results in under-utilization of the computing resources as these applications
mostly have high communication and synchronization overheads. Hence, efficient processing
of graphs on such heterogeneous platforms poses interesting and challenging problems. There
is an increasing demand for graph processing frameworks which can provide high-level APIs
to users to automatically harness the underlying parallel hardware. In order to process large
graphs frameworks for multi-node distributed memory platforms is also very common. It is also
important to design a framework which can automatically parallelize the graph algorithms in
a multi-node multi-device environment.
Efficient processing of graph algorithms on heterogeneous CPU-GPU systems requires effectively
harnessing the combined power of both the CPU and GPU devices. In this thesis, we first
present HyPar a novel framework for graph processing in heterogeneous architectures based on
the divide-and-conquer paradigm. The framework partitions a given graph across the devices
and performs simultaneous independent computations on both the devices. The framework
provides some simple and generic APIs, supported with efficient runtime strategies for hybrid
executions. Using experiments with minimum spanning tree, community detection, triangle
counting, graph coloring and connected component applications on a heterogeneous system, we
show that our HyPar framework provides average performance improvements of 22-74% over the
state-of-art, optimized CPU-only and GPU-only implementations of the corresponding applications.
When compared to the prevalent BSP approach for multi-device executions of graphs our HyPar framework yields 74-92% average performance improvements. When compared with
Ligra, a state-of-art multi-core graph processing framework, we obtain average performance
improvements of 24-59%.
While a single node heterogeneous framework can help explore medium size graphs it is
also important to design an engine which can help explore very large size real-world graphs.
Moreover, the large scale HPC systems can also provide substantial bene ts over the single
node heterogeneous systems for these real-world data sets. Hence, we extended our HyPar
framework to distributed heterogeneous settings. We partition the graph across cluster nodes
and further to the devices in a single node using HyPar for independent processing. We then
merge the results from the different nodes using a novel hybrid merging algorithm to ensure
that the combined results on a node never exceeds it memory capacity. In our experiments,
we show that our proposed algorithm shows large-scale improvements over an existing stateof-
art approach. We also show that the algorithm exhibits almost linear scalability, and the
use of GPUs result in up to 23% improvement in performance over our multi-node CPU-only
implementation.