EMF: System Design and Challenges for Disaggregated GPUs in datacenters for Efficiency, Modularity and Flexibility
Abstract
With Dennard Scaling phasing out in the mid-2000s, architectural scaling and hardware specialization
take centre stage to provide performance bene fits with already stalling Moore's law. An
outcome from this hardware specialization is GPU which exploits the Data Level Parallelism
in an application. One approach is to augment the existing infrastructure with accelerators
like GPUs that cater to data-parallel, throughput centric workloads ranging from AI, HPC to
Visualization. Further, the availability of GPUs in Public Cloud offerings has expedited their
mass adoption. At the same time these modern cloud-based applications are placing increasing
demands on infrastructure in-terms of versatility, performance and e ciency. The high acquisition
and operational cost of GPUs necessitate their optimal utilization while avoiding common
pitfalls like resource stranding.
Disaggregating expensive and power hungry GPUs will enable a cost-e cient and adaptive
ecosystem for their deployments. In this work, we rst quantify the gains associated with disaggregated
GPU deployments using metrics like failed VM requests and GPU Watt Hours consumption.
For this, we use QUADD-SIM, a simulator we built to model, quantify, and contrast
di erent facets of these emerging GPU deployments. Using QUADD-SIM we model different
VM and resource provisioning aspects of disaggregated GPU deployments. We simulate realistic
AI workload requests for a period of 3 months with characteristics derived from recent
public datacenter traces. Our results attest that disaggregated GPU deployment strategies
outperform traditional GPU deployments in terms of failed VM requests and GPU Watt-hours
consumption. We showed through extensive experimentation that 5.14% and 7.90% additional
failed VM requests were serviced by disaggregated GPU deployments consuming 10.92% and
3.30% lesser GPU Watt-hours compared to traditional deployment.
As our second contribution, we then identify how the disaggregation constructs could be met
at different abstraction levels for NVIDIA GPU computing stack. We introduce the notion of
Disaggregation Plane to understand the feasibility and limitations of a disaggregated solution We then evaluate various GPU disaggregation solution approaches with Disaggregation plane
using the following metrics: 1) Composability, 2) Independent existence, and 3) Backward Compatibility.
Based on this analysis, we then propose EMF: a rack-level, open system for GPU
Disaggregation. EMF, in addition to supporting core disaggregation constructs (i.e. independent
existence and composability) also provides backward compatibility. We highlight some key
design abstractions, elements, and some pressing issues in realizing the system while presenting
the design of EMF. Lastly, we evaluate performance impact by quantifying worst-case latency
overheads due to disaggregation. We model Host device driver and GPU interactions for data
transfer operations over PCIe in terms of TLPs to understand the performance impact due to
our design. Further evaluation with 6 Deep Learning applications shows that these overheads
could vary from 7.6% to 20.2%, justifying the practicality of our design. We found that the
latency overheads are directly correlated with the Average Throughput of the application and
applications with short-lifetimes having bursty data-transfer characteristics may show visible
performance degradation.