A Study of the Performance Benefits of Controlling Parallel Asynochrous Iteractive Applications
Abstract
High performance networks of workstation are becoming increasingly popular a parallel computing platform because of their lower cost. Both message passing and software distributed shared memory (DSM) programming paradigms have been developed and employed on such distributed hardware platforms. An important performance bottleneck in these systems is the effective data transmission latency, which is poorer than in high-speed parallel computer interconnection networks.
Iterative algorithms are used in a large class of applications like solution of partial algorithms are used, optimization problems, solutions to systems of linear equations, and so on. These can be parallelized in a straight-forward fashion with cad1 node computing a part of the data set and also synchronizing and exchanging data with the other nodes as required. But these synchronous version perform poorly when message transmission delays are high, as is the case in network of workstations. The asynchronous parallel version of these algorithms provide an additional degree of freedom to address large data transmission latencies. These algorithms do not synchronize, and behave correctly in the presence of losses and delays in the propagation of updates. Thus, in shared memory systems they do not synchronize accesses to shared data and they will work correctly even in the presence of delays and losses in updates. This gives synchronous algorithms a performance advantage over their synchronous counterparts since synchronization costs are avoided and further computation can be overlapped with communication.
The message generation rate of asynchronous algorithms is however greater than that of their synchronous counterparts. Uncontrolled asynchronous runs can create a large network load resulting in large queuing delays, which in turn can increase the message generation of the asynchronous algorithms. This is especially possible in lower bandwidth network like that in network of workstations. Such a positive feedback loop leads to unstable network conditions.
Recent work has tried to improve the performance of asynchronous algorithms on a distributed shared memory (DSM) system by adaptively buffering shared memory updates depending on the network load, and transmitting multiple updates together. This reduces congestion and message transmission overhead, but could still result in slow convergence since nothing is guaranteed about, the update propagation delay. Also, although adaptive throttling of message will kick in once the network gets heavily loaded, it cannot, prevent the initial flooding. Furthermore, the processor is not freed when computation with the available values does not result in much further convergence.
In this thesis we present an alternate method of controlling iterative methods and present performance results for the same. We propose a new system-supported blocking read primitive, termed Global Read that is guaranteed to return a value of acceptable age of the specified location in a DSM system. The main idea is to enforce an upper bound on the age of shared updates seen by a node in a manner visible to the underlying system (DSM). Information about processes being blocked can be used for adapting the underlying system, especially the network, towards better performance. A reading process is throttled until its Global-Read is satisfied, thus implementing program-level flow control and also freeing the processor. The Global-Read can also help in communication-based scheduling of, processes.
Performance evaluation using a benchmark from Cray, on a network of workstations and on the IBM SP2 parallel computer, showed good performance improvements. We also present results of a systematic study wherein we implement parallel code for different iterative techniques for the solution of Lap lace equation wing PVM, anti characterize when controlled asynchrony work befit. We studied the improvements in computation time and analyzed the sources of this improvement, using various input and parallelism, on both IBM SP2 and a network of workstations. We find significant performance improvements for controlling asynchrony when the traffic generated by the asynchronous algorithm becomes more than what can be sustained by the network. But we found that the scalability of the applications is limited by the software overhead for messages. Systems which have reduced software overhead will show very good scalable performance for controlled asynchrony.