Hyperplane Partitioning : An Approach To Global Data Partitioning For Distributed Memory Machines
Automatic Global Data Partitioning for Distributed Memory Machines (DMMs) is a difficult problem. Distributed memory machines are scalable, but since the memory is distributed across processors, the scheme of placement of data (arrays) onto local memories of different processors become crucial since any communication between processors for non-local data access is an order of magnitude costlier than access to local memory. Researchers have given varied solutions to this problem, most of which work for uniform dependences in loops and they suggest HPF-like distributions only. For non-uniform dependences the loop was made to run sequentially. In this work, we present a partitioning strategy called Hyperplane Partitioning which works well with loops with non-uniform dependences also. In this method of partitioning, the iteration space is partitioned into as many number of partitions as there are number of logical processors, in such a way that the overall inter-processor communication will be minimum. The idea is to localize as many as dependences as possible so that overall communication both beacuse of non-local data as well as inter-processor synchronizations are reduced. These partitions are then induced into data spaces of the arrays referenced in the loop. Each processor then runs its part of iteration space keeping the data partition that it owns locally. Any non-local data access is implemented by inter-processor communication at run-time.The Hyperplane Partitioning is also extended to a sequence of loops. This is done by first finding Best Local Distribution (BLD) for every loop first and then finding the best way of grouping different adjacent loops (just for finding the data partition) which gives best global data partition. This sequence of distributions/redistributions is found by constructing a data structure called Data Distribution Tree (DDT) and finding the least cost path from the source to any of the leaf nodes in the DDT. The costs for the edges come from the communication cost incurred while running a loop with a particular distribution and redistribution to suit the requirement at the next loop. For this a communication cost estimator is developed which works well for fewer dimensions. To handle complete programs we use some heuristic to find the best global distribution for the entire program.Some optimizations like message optimization to reduce the number of messages sent across processors, time optimization which is done by uniform scheduling across processors, and space optimization to keep only the part of array space that any processor owns onto its local memory, are studied. Hyperplane Partitioning is also implemented using an algorithm for synchronization to handle non-local memory access as well as obeying data dependence constraints. The algorithm is also proved to be correct. The target machine is IBM-SP2 using PVM for the message passing library. The performance of the tool on some standard benchmarks (ADI and RHS) and also on some programs designed by us to show the specific merits of the tool. The results show that the loops which have non-uniform dependences also can be run on DMM with good speed-ups.