Efficient Execution Of AMR Computations On GPU Systems
Raghavan, Hari K
MetadataShow full item record
Adaptive Mesh Refinement (AMR) is a method which dynamically varies the spatio-temporal resolution of localized mesh regions in numerical simulations, based on the strength of the solution features. Due to high resolution discretization of localized regions of interests into rectangular mesh units called patches, AMR provides low cost of computations and high degree of accuracy. General purpose graphics processing units (GPGPUs) with their support for fine-grained parallelism, offer an attractive option for obtaining high performance for AMR applications. The data parallel computations of the finite difference schemes of AMR can be efficiently performed on GPGPUs. This research deals with challenges and develops techniques for efficient executions of AMR applications with uniform and non-uniform patches on GPUs. In the first part of the thesis, we optimize an AMR model with uniform patches. We have developed strategies for continuous online visualization of time evolving data for AMR applications executed on GPUs. In-situ visualization plays an important role for analyzing the time evolving characteristics of the domain structures. Continuous visualization of the output data for various time steps results in better study of the underlying domain and the model used for simulating the domain. We reorder the meshes for computations on the GPU based on the users input related to the subdomain that he wants to visualize. This makes the data available for visualization at a faster rate. We then perform asynchronous executions of the visualization steps and fix-up operations on the coarse meshes on the CPUs while the GPU advances the solution. By performing experiments on Tesla S1070 and Fermi C2070 clusters, we found that our strategies result in up to 60% improvement in response time and 16% improvement in the rate of visualization of frames over the existing strategy of performing fix-ups and visualization at the end of the time steps. The second part of the thesis deals with adaptive strategies for efficient execution of block structured AMR applications with non-uniform patches on GPUs. Most AMR approaches use patches of uniform sizes over regions of interests. Since this leads to over-refinement, some efforts have focused on forming patches of non-uniform dimensions to improve computational efficiency since the dimensions of a patch can be tuned to the geometry of a region of interest. While effective hybrid execution strategies exist for applications with uniform patches, our work considers efficient execution of non-uniform patches with different workloads. Our techniques include a geometric bin-packing method to load balance GPU computations and reduce thread idling, adaptive determination of amount of work to maximize asynchronism between CPU and GPU executions using a knapsack formulation, and scheduling communications for multi-GPU executions. We test our strategies for synthetic inputs as well as for traces from real applications. Our experiments on Tesla S1070 and Fermi C2070 clusters with both single-GPU and multi-GPU executions show that our strategies result in up to 69% improvement in performance over existing strategies. Our bin-packing based load balancing gives performance gains up to 39%, kernel optimizations give an improvement of up to 20%, and our strategies for adaptive asynchronism between CPU-GPU executions give performance improvements of up to 17% over default static asynchronous executions.