EMF: System Design and Challenges for Disaggregated GPUs in datacenters for Efficiency, Modularity and Flexibility
With Dennard Scaling phasing out in the mid-2000s, architectural scaling and hardware specialization take centre stage to provide performance bene fits with already stalling Moore's law. An outcome from this hardware specialization is GPU which exploits the Data Level Parallelism in an application. One approach is to augment the existing infrastructure with accelerators like GPUs that cater to data-parallel, throughput centric workloads ranging from AI, HPC to Visualization. Further, the availability of GPUs in Public Cloud offerings has expedited their mass adoption. At the same time these modern cloud-based applications are placing increasing demands on infrastructure in-terms of versatility, performance and e ciency. The high acquisition and operational cost of GPUs necessitate their optimal utilization while avoiding common pitfalls like resource stranding. Disaggregating expensive and power hungry GPUs will enable a cost-e cient and adaptive ecosystem for their deployments. In this work, we rst quantify the gains associated with disaggregated GPU deployments using metrics like failed VM requests and GPU Watt Hours consumption. For this, we use QUADD-SIM, a simulator we built to model, quantify, and contrast di erent facets of these emerging GPU deployments. Using QUADD-SIM we model different VM and resource provisioning aspects of disaggregated GPU deployments. We simulate realistic AI workload requests for a period of 3 months with characteristics derived from recent public datacenter traces. Our results attest that disaggregated GPU deployment strategies outperform traditional GPU deployments in terms of failed VM requests and GPU Watt-hours consumption. We showed through extensive experimentation that 5.14% and 7.90% additional failed VM requests were serviced by disaggregated GPU deployments consuming 10.92% and 3.30% lesser GPU Watt-hours compared to traditional deployment. As our second contribution, we then identify how the disaggregation constructs could be met at different abstraction levels for NVIDIA GPU computing stack. We introduce the notion of Disaggregation Plane to understand the feasibility and limitations of a disaggregated solution We then evaluate various GPU disaggregation solution approaches with Disaggregation plane using the following metrics: 1) Composability, 2) Independent existence, and 3) Backward Compatibility. Based on this analysis, we then propose EMF: a rack-level, open system for GPU Disaggregation. EMF, in addition to supporting core disaggregation constructs (i.e. independent existence and composability) also provides backward compatibility. We highlight some key design abstractions, elements, and some pressing issues in realizing the system while presenting the design of EMF. Lastly, we evaluate performance impact by quantifying worst-case latency overheads due to disaggregation. We model Host device driver and GPU interactions for data transfer operations over PCIe in terms of TLPs to understand the performance impact due to our design. Further evaluation with 6 Deep Learning applications shows that these overheads could vary from 7.6% to 20.2%, justifying the practicality of our design. We found that the latency overheads are directly correlated with the Average Throughput of the application and applications with short-lifetimes having bursty data-transfer characteristics may show visible performance degradation.