Show simple item record

dc.contributor.advisorSimmhan, Yogesh
dc.contributor.authorPrashanthi, S K
dc.date.accessioned2025-10-10T05:13:40Z
dc.date.available2025-10-10T05:13:40Z
dc.date.submitted2025
dc.identifier.urihttps://etd.iisc.ac.in/handle/2005/7173
dc.description.abstractDeep Neural Networks (DNNs) have had a significant impact on a wide variety of domains, such as Autonomous Vehicles, Smart Cities, and Healthcare, through low-latency inferencing on edge computing devices close to the data source. Recently, there has also been a push towards training DNN models on accelerated edge devices having on-board Graphics Processing Units (GPUs) with 100-1000s of Compute Unified Device Architecture (CUDA) cores. This is driven by the increasing data collected from edge devices in Cyber-Physical Systems (CPS) and Internet of Things (IoT), the growing computing power of edge devices, and the rise of on-device training paradigms such as Federated Learning and Continuous Learning that focus on privacy and personalization. Existing literature has primarily focused on optimizing edge inference. There is limited systems research on optimizing DNN training, and concurrent training and inference on edge accelerators. Previous work on server GPUs cannot be directly applied to edge devices since they have architectural distinctions from cloud/server GPUs, in particular their 1000s of power modes consisting of Central Processing Unit (CPU) core count and CPU, GPUs and memory frequencies. They are also used in varied field deployments that impose power or energy constraints. In this dissertation, we characterize, model and predict the behavior of NVIDIA Jetson edge accelerators and their power mode configurations for DNN workloads. These employ both empirical Machine Learning (ML) based models and analytical roofline driven models. We leverage these to design system optimizations to tune the edge platform for DNN training and inference workloads, and help DNNs effectively utilize the full potential of accelerated edge hardware. We first motivate the need for training on the edge and the associated systems research challenges through a rigorous empirical performance characterization of four classes of NVIDIA Jetson accelerated edge devices for DNN training. We vary parameters of the Pytorch training framework and edge device, such as I/O pipelining and parallelism, storage media, mini-batch sizes and power modes, and examine their effect on CPU and GPU utilization, fetch stalls, training time, energy usage, and variability. Our analysis exposes several resource inter-dependencies and counter-intuitive insights, while also helping quantify known wisdom. We also study the impact of containerized DNN inference and training workloads and contrast it against bare metal execution on running time, CPU, GPU and memory utilization, and energy consumption. Building upon these insights, we develop PowerTrain, a transfer-learning approach to accurately predict the performance and power usage of a new DNN training workload for any given power mode. PowerTrain does a one-time costly profiling of 100s of power models for one DNN model training on a Jetson device to train a reference prediction model. It is then able to generalize this using transfer learning to different DNN models, datasets and edge devices with limited custom profiling. We use these predictions to instantly construct a Pareto frontier for the behavior of the new DNN workload and decide the power mode configuration that minimizes the training time within a power budget. Our predictions outperform the NVIDIA prediction tool and other baselines, and have low prediction errors of 5-15% on time and power. In Pagoda, we investigate analytical roofline-based characterization to understand and explain the impact of power modes for various workloads. We develop a time roofline and a novel energy roofline model for diverse power modes. We couple this with an analytical model of the compute (FLOP) and memory access (bytes) for DNN workloads to analyze them from first principles. Lastly, we apply these methods to modify the power mode and, hence, the roofline of the edge device to optimize the latency and energy usage for DNN inference. Our experiments show energy benefits of up to 15% with minimal degradation in time. Finally, we design Fulcrum, a scheduler that optimizes the power and performance of DNN training and inference workloads, both individually and when run concurrently. Specifically, we develop a managed interleaving approach for concurrent workload execution scheduled at the minibatch granularity, offering low variability in the inference latency compared to native interleaving done by the GPU scheduler. We also propose two novel optimizations that satisfy the diverse QoS goals of meeting inference latency and maximizing training throughput while staying within a power budget for field deployments. Our gradient descent-based multi-dimensional search approach (GMD) quickly converges to a solution with lesser profiling of power modes, while our active-learning-based approach (ALS) generalizes well across various problem configurations. Both our strategies outperform baselines and are close to the optimal solution. Together, these contributions holistically offer a deeper understanding of the performance of DNN workloads on edge accelerators, help accurately model the impact of power modes on their performance and power usage, and provide systems optimizations to effectively leverage edge accelerators for DNN training and inferencing in practical situations.en_US
dc.language.isoen_USen_US
dc.relation.ispartofseries;ET01102
dc.rightsI grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertationen_US
dc.subjectSystems for MLen_US
dc.subjectPower and performance predictionen_US
dc.subjectEdge Acceleratorsen_US
dc.subjectDNN training and inferenceen_US
dc.subjectPerformance modelling and optimizationen_US
dc.subjectDeep Neural Networksen_US
dc.subjectCompute Unified Device Architectureen_US
dc.subjectMachine Learningen_US
dc.subject.classificationResearch Subject Categories::TECHNOLOGY::Information technology::Computer scienceen_US
dc.titleSystems Optimizations for DNN Training and Inference on Accelerated Edge Devicesen_US
dc.typeThesisen_US
dc.degree.namePhDen_US
dc.degree.levelDoctoralen_US
dc.degree.grantorIndian Institute of Scienceen_US
dc.degree.disciplineEngineeringen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record