Scalable Asynchrony-Tolerant PDE Solver for Multi-GPU Systems
Abstract
Partial differential equations (PDEs) are used to model various natural phenomena and engineered
systems. At conditions of practical interest, these equations are highly non-linear and demand
massive computations. Current state-of-the-art simulations are routinely being performed on
supercomputers with hundreds of thousands of processing elements. With an increase in compute
intensity per node and an increase in node count as the world moves towards exa-scale machines,
communication and synchronization costs pose a major bottleneck on the performance of PDE
solvers. A standard approach to mitigate these bottlenecks is to enhance the overlap between
communication and computation in an algorithmic implementation. Another approach also looks at
relaxing communication and synchronization requirements, but at the base mathematical or
numerical method level. Asynchrony-tolerant (AT) numerical schemes follow this second approach
where larger stencils are used to relax these boundary value communications while maintaining the
required order of accuracy.
In the first part of this work, the performance of previously derived finite difference AT schemes
was investigated on GPUs. GPUs are designed to deliver a high throughput, but suffer from high
latency for data movement. Therefore, AT schemes can be used to hide the latency and achieve
scalable performance. Two algorithms to apply such AT schemes, namely the communication
avoiding which is deterministic and the synchronization avoiding which is probabilistic, have been
implemented to develop a solver for turbulence problems based on the compressible Navier-Stokes
equations. The solver was developed for multi-GPU multi-node systems using the MPI+CUDA
model. The code was profiled and optimised with techniques such as tiling to increase coalesced
memory access and manage register usage. Scaling studies were then performed and up to 50%
improvement in performance has been obtained over the baseline synchronous implementation in
benchmarks running up til 1024 nodes on OLCF Summit supercomputer.
In the second part of the work, new asynchrony-tolerant schemes for the multi-stage Runge-Kutta
methods have been developed, particularly in the context of low storage explicit Runge-Kutta
(LSERK) schemes. The performance of the LSERK-AT schemes was demonstrated using a mini-
app that solves the non-linear Burgers’ equations and parallelised with MPI. Benchmarks performed
on SahasraT showed a peak speedup of 6x at an extreme scale of 27,000 cores.