dc.description.abstract | Deep Neural Networks (DNNs) have had a significant impact on a wide variety of domains, such as Autonomous Vehicles, Smart Cities, and Healthcare, through low-latency inferencing on edge computing devices close to the data source. Recently, there has also been a push towards training DNN models on accelerated edge devices having on-board Graphics Processing Units (GPUs) with 100-1000s of Compute Unified Device Architecture (CUDA) cores. This is driven by the increasing data collected from edge devices in Cyber-Physical Systems (CPS) and Internet of Things (IoT), the growing computing power of edge devices, and the rise of on-device training paradigms such as Federated Learning and Continuous Learning that focus on privacy and personalization.
Existing literature has primarily focused on optimizing edge inference. There is limited systems research on optimizing DNN training, and concurrent training and inference on edge accelerators. Previous work on server GPUs cannot be directly applied to edge devices since they have architectural distinctions from cloud/server GPUs, in particular their 1000s of power modes consisting of Central Processing Unit (CPU) core count and CPU, GPUs and memory frequencies. They are also used in varied field deployments that impose power or energy constraints. In this dissertation, we characterize, model and predict the behavior of NVIDIA Jetson edge accelerators and their power mode configurations for DNN workloads. These employ both empirical Machine Learning (ML) based models and analytical roofline driven models. We leverage these to design system optimizations to tune the edge platform for DNN training and inference workloads, and help DNNs effectively utilize the full potential of accelerated edge hardware.
We first motivate the need for training on the edge and the associated systems research challenges through a rigorous empirical performance characterization of four classes of NVIDIA Jetson accelerated edge devices for DNN training. We vary parameters of the Pytorch training framework and edge device, such as I/O pipelining and parallelism, storage media, mini-batch sizes and power modes, and examine their effect on CPU and GPU utilization, fetch stalls, training time, energy usage, and variability. Our analysis exposes several resource inter-dependencies and counter-intuitive insights, while also helping quantify known wisdom. We also study the impact of containerized DNN inference and training workloads and contrast it against bare metal execution on running time, CPU, GPU and memory utilization, and energy consumption.
Building upon these insights, we develop PowerTrain, a transfer-learning approach to accurately predict the performance and power usage of a new DNN training workload for any given power mode. PowerTrain does a one-time costly profiling of 100s of power models for one DNN model training on a Jetson device to train a reference prediction model. It is then able to generalize this using transfer learning to different DNN models, datasets and edge devices with limited custom profiling. We use these predictions to instantly construct a Pareto frontier for the behavior of the new DNN workload and decide the power mode configuration that minimizes the training time within a power budget. Our predictions outperform the NVIDIA prediction tool and other baselines, and have low prediction errors of 5-15% on time and power.
In Pagoda, we investigate analytical roofline-based characterization to understand and explain the impact of power modes for various workloads. We develop a time roofline and a novel energy roofline model for diverse power modes. We couple this with an analytical model of the compute (FLOP) and memory access (bytes) for DNN workloads to analyze them from first principles. Lastly, we apply these methods to modify the power mode and, hence, the roofline of the edge device to optimize the latency and energy usage for DNN inference. Our experiments show energy benefits of up to 15% with minimal degradation in time.
Finally, we design Fulcrum, a scheduler that optimizes the power and performance of DNN training and inference workloads, both individually and when run concurrently. Specifically, we develop a managed interleaving approach for concurrent workload execution scheduled at the minibatch granularity, offering low variability in the inference latency compared to native interleaving done by the GPU scheduler. We also propose two novel optimizations that satisfy the diverse QoS goals of meeting inference latency and maximizing training throughput while staying within a power budget for field deployments. Our gradient descent-based multi-dimensional search approach (GMD) quickly converges to a solution with lesser profiling of power modes, while our active-learning-based approach (ALS) generalizes well across various problem configurations. Both our strategies outperform baselines and are close to the optimal solution.
Together, these contributions holistically offer a deeper understanding of the performance of DNN workloads on edge accelerators, help accurately model the impact of power modes on their performance and power usage, and provide systems optimizations to effectively leverage edge accelerators for DNN training and inferencing in practical situations. | en_US |