Scalable Asynchrony-Tolerant PDE Solver for Multi-GPU Systems

Matthew, Vinod Jacob

dc.contributor.advisor	Aditya, Konduri
dc.contributor.author	Matthew, Vinod Jacob
dc.date.accessioned	2022-07-18T10:58:43Z
dc.date.available	2022-07-18T10:58:43Z
dc.date.submitted	2022
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/5785
dc.description.abstract	Partial differential equations (PDEs) are used to model various natural phenomena and engineered systems. At conditions of practical interest, these equations are highly non-linear and demand massive computations. Current state-of-the-art simulations are routinely being performed on supercomputers with hundreds of thousands of processing elements. With an increase in compute intensity per node and an increase in node count as the world moves towards exa-scale machines, communication and synchronization costs pose a major bottleneck on the performance of PDE solvers. A standard approach to mitigate these bottlenecks is to enhance the overlap between communication and computation in an algorithmic implementation. Another approach also looks at relaxing communication and synchronization requirements, but at the base mathematical or numerical method level. Asynchrony-tolerant (AT) numerical schemes follow this second approach where larger stencils are used to relax these boundary value communications while maintaining the required order of accuracy. In the first part of this work, the performance of previously derived finite difference AT schemes was investigated on GPUs. GPUs are designed to deliver a high throughput, but suffer from high latency for data movement. Therefore, AT schemes can be used to hide the latency and achieve scalable performance. Two algorithms to apply such AT schemes, namely the communication avoiding which is deterministic and the synchronization avoiding which is probabilistic, have been implemented to develop a solver for turbulence problems based on the compressible Navier-Stokes equations. The solver was developed for multi-GPU multi-node systems using the MPI+CUDA model. The code was profiled and optimised with techniques such as tiling to increase coalesced memory access and manage register usage. Scaling studies were then performed and up to 50% improvement in performance has been obtained over the baseline synchronous implementation in benchmarks running up til 1024 nodes on OLCF Summit supercomputer. In the second part of the work, new asynchrony-tolerant schemes for the multi-stage Runge-Kutta methods have been developed, particularly in the context of low storage explicit Runge-Kutta (LSERK) schemes. The performance of the LSERK-AT schemes was demonstrated using a mini- app that solves the non-linear Burgers’ equations and parallelised with MPI. Benchmarks performed on SahasraT showed a peak speedup of 6x at an extreme scale of 27,000 cores.	en_US
dc.language.iso	en_US	en_US
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Asynchrony Tolerant Schemes	en_US
dc.subject	Partial Differential Equations	en_US
dc.subject	Scalable Computing	en_US
dc.subject	Runge-Kutta Methods	en_US
dc.subject	Runge-Kutta (LSERK) Schemes	en_US
dc.subject.classification	Computer Science	en_US
dc.title	Scalable Asynchrony-Tolerant PDE Solver for Multi-GPU Systems	en_US
dc.type	Thesis	en_US
dc.degree.name	MTech (Res)	en_US
dc.degree.level	Masters	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: thesis_VJM.pdf
Size:: 4.081Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Department of Computational and Data Sciences (CDS) [102]

Show simple item record