Design and Evaluation of Parallel Coded Systems

Badita, Ajay Kumar

dc.contributor.advisor	Parag, Parimal
dc.contributor.author	Badita, Ajay Kumar
dc.date.accessioned	2021-09-06T07:20:33Z
dc.date.available	2021-09-06T07:20:33Z
dc.date.submitted	2021
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/5269
dc.description.abstract	In this computer era, we all live in a place where the demand for data and computing is increasing day by day. Since the need for faster data retrieval and faster computation brings us a reliability as the solution, we need more more data storage points or computation units. Motivated by scalability, availability, and reliability, there has been a paradigm shift from centralized storage (computation) at a large supercomputer to distributed storage (computing) on a large cluster of regular servers to handle complex tasks. In distributed storage setting, a single file is divided into smaller number of subfiles, which are then stored across multiple nodes, and the file requests are handled by the storage cluster. Similarly, in dis- tributed compute setting, a single task is fragmented into a smaller number of subtasks, and processed by the compute cluster. File request time (task completion time) is limited by the slowest execution time of the parallel subtasks. The lagging subfile requests (subtasks) are referred to as stragglers, and they delay the entire file retrieval (task execution). Straggling servers is one of the challenges in distributed storage and compute systems. Redundancy has emerged as a popular technique to mitigate the impact of stragglers. Redundant subfile requests (compute subtasks) can be sent to a larger set of storage (compute) nodes, such that a smaller subset suffices for the file (task) completion. This approach can be used for straggler mitigation in the face of uncertainty in file retrieval (task execution) times at the storage (compute) nodes. Coding theoretic techniques can be employed to systematically control the redundancy in storage and compute systems. The effective reliable strategies can be either replication coding or maximum distance separable (MDS) coding. Our work introduces new analytical bounds and approximation techniques for the latency-redundancy tradeoff for a range of system loads and a class of symmetric redundancy schemes, under the assumption of Poisson arrivals, exponential service-rates, and fork-join scheduling policy. The proposed approach can be employed to efficiently approximate the latency distribution of a queueing system at equilibrium. We also establish the stability region in terms of arrival rates for redundant systems with certain symmetries, which offers selection guidelines for design parameters to provide latency guarantees based on the proposed approximations. We observe that redundancy can mitigate the impact of stragglers, however, it comes at the cost of additional server nodes working on redundant coded subfiles (subtasks). This cost can be measured by the amount of work done by all server nodes, called aggregate server utilization. In a proactive mitigation approach, we attempt to show that server utilization can be reduced by dynamic coded redundancy, where the number of redundant servers available to a file (task) changes with time. We are interested in dynamic coded redundancy for a single-file (single-compute task) with coded-subfiles (coded-subtasks). In dynamic coded redundancy, additional redundant coded subfiles (subtasks) are spawned on individual server nodes, adaptively over time. The instants at which coded subfiles (subtasks) are spawned are referred to as forking points. We are interested in optimal adaptive strategy such that a linear combination of the mean file retrieval time (mean task completion time) and the mean server utilization is minimized. Specifically, we address the following two interdependent questions: How should we select the forking points? How many coded subfiles (subtasks) should be initiated at each forking point? For this flexible forking strategy with multiple parameters, we analyze the mean of two performance metrics when the random service completion time at each node is independent and distributed identically (i.i.d.) to a shifted exponential. From this study, we find a tradeoff between the metrics which provides insights into the parameter choices. Experiments on Intel DevCloud illustrate that the shifted exponential distribution adequately captures the random coded subtask completion times, and our derived insights continue to hold.	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	;IISc-2021-0073
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Distributed Systems	en_US
dc.subject	Straggler Mitigation	en_US
dc.subject	Forking Point	en_US
dc.subject	Distributed Storage	en_US
dc.subject	Distributed Computing	en_US
dc.subject	Redundancy	en_US
dc.subject	dynamic coded redundancy	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Information technology::Computer science	en_US
dc.title	Design and Evaluation of Parallel Coded Systems	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: Design and Evaluation of Parallel ...
Size:: 2.880Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Electrical Communication Engineering (ECE) [471]

Show simple item record