Scaling Up GPU Memory Management

Pratheek, B

View/Open

Thesis full text (3.136Mb)

Author

Pratheek, B

Metadata

Show full item record

Abstract

The volume of data generated worldwide is growing at an unprecedented rate, and GPUs have emerged as the primary compute engine for processing this data. While GPUs offer massive compute power — reaching thousands of TFLOPS — they are constrained by their relatively low memory bandwidth (around 1 TBps) and limited on-board memory capacity (in tens of GBs). As a result, memory is often the primary performance bottleneck in many GPU applications. This thesis explores two key aspects of scaling up GPU memory management to tackle the memory bottleneck: 1) improving address translation to reduce memory access latency, and 2) improving GPU memory oversubscription to enable GPU programs to work effectively with large datasets. In the first part of this thesis, we studied the impact of the non-uniformity of Multi-Chip-Module (MCM) design on address translation in GPUs. With the end of Moore’s law, GPU manufacturers are moving towards MCM design to scale GPUs beyond the limits of monolithic chips, by integrating multiple smaller ‘chiplets' with high bandwidth interconnects to create larger GPUs. Unfortunately, this disaggregated nature of MCM design leads to non-uniform resource access latencies. In the first work, we analyzed how MCM design impacts address translation — a critical path operation that impacts memory access latencies. We observed that disaggregation of address translation mechanisms (i.e., TLBs and page walkers) in MCM GPUs lead to severe increase in address translation latencies (and memory access latencies), thus degrading performance significantly. We proposed MCM-aware GPU Virtual Memory (MGVM), which leverages the access patterns of GPU applications to limit remote TLB and page table accesses, ultimately improving performance by 52% across diverse applications. The second part of this thesis comprises of two works which tackle challenges posed by GPU memory oversubscription under NVIDIA’s Unified Virtual Memory (UVM) technology. UVM enables the oversubscription of GPU’s limited on-board memory capacity by using the CPU memory as secondary storage and performing programmer-transparent page migrations and evictions, thereby allowing GPU applications to scale easily to large datasets. Unfortunately, applications under GPU memory oversubscription often experience significant slowdowns due to the latency of page migrations and GPU memory thrashing. In the second work, SUV, we observed that much of the slowdown of memory oversubscription can be eliminated with appropriate data placement and prefetching strategies, informed by the memory access patterns exhibited by the GPU programs. SUV utilizes static analysis techniques to extract memory access patterns of GPU programs, and then combines them with runtime information (e.g., sizes of memory allocations) to guide data placement, migration, and prefetching between CPU memory and GPU memory. SUV improves the performance of a variety of applications by 77% over UVM. The third work, SuperUVM, focusses on improving UVM’s page eviction and prefetching policies by enabling ‘observability' into GPU’s memory accesses to pages resident on GPU memory — something current GPUs do not provide. We show that UVM’s current eviction and prefetching policies — handled by the UVM driver running on the CPU — are limited in their ability to make informed decisions due to the lack of observability. SuperUVM enables observability into GPU’s memory accesses by repurposing existing hardware access counters, enabling better-informed eviction and prefetching polices. SuperUVM improves UVM performance by around 33% across 10 applications.

URI

https://etd.iisc.ac.in/handle/2005/9615

Collections

Computer Science and Automation (CSA) [561]