Design and Evaluation of Parallel Coded Systems
Abstract
In this computer era, we all live in a place where the demand for data and computing is increasing day by day. Since the need for faster data retrieval and faster computation brings us a reliability as the solution, we need more more data storage points or computation units. Motivated by scalability, availability, and reliability, there has been a paradigm shift from centralized storage (computation) at a large supercomputer to distributed storage (computing) on a large cluster of regular servers to handle complex tasks. In distributed storage setting, a single file is divided into smaller number of subfiles, which are then stored across multiple nodes, and the file requests are handled by the storage cluster. Similarly, in dis- tributed compute setting, a single task is fragmented into a smaller number of subtasks, and processed by the compute cluster. File request time (task completion time) is limited by the slowest execution time of the parallel subtasks. The lagging subfile requests (subtasks) are referred to as stragglers, and they delay the entire file retrieval (task execution). Straggling servers is one of the challenges in distributed storage and compute systems.
Redundancy has emerged as a popular technique to mitigate the impact of stragglers. Redundant subfile requests (compute subtasks) can be sent to a larger set of storage (compute) nodes, such that a smaller subset suffices for the file (task) completion. This approach can be used for straggler mitigation in the face of uncertainty in file retrieval (task execution) times at the storage (compute) nodes. Coding theoretic techniques can be employed to systematically control the redundancy in storage and compute systems. The effective reliable strategies can be either replication coding or maximum distance separable (MDS) coding. Our work introduces new analytical bounds and approximation techniques for the latency-redundancy tradeoff for a range of system loads and a class of symmetric redundancy schemes, under the assumption of Poisson arrivals, exponential service-rates, and fork-join scheduling policy. The proposed approach can be employed to efficiently approximate the latency distribution of a queueing system at equilibrium. We also establish the stability region in terms of arrival rates for redundant systems with certain symmetries, which offers selection guidelines for design parameters to provide latency guarantees based on the proposed approximations.
We observe that redundancy can mitigate the impact of stragglers, however, it comes at the cost of additional server nodes working on redundant coded subfiles (subtasks). This cost can be measured by the amount of work done by all server nodes, called aggregate server utilization. In a proactive mitigation approach, we attempt to show that server utilization can be reduced by dynamic coded redundancy, where the number of redundant servers available to a file (task) changes with time. We are interested in dynamic coded redundancy for a single-file (single-compute task) with coded-subfiles (coded-subtasks). In dynamic coded redundancy, additional redundant coded subfiles (subtasks) are spawned on individual server nodes, adaptively over time. The instants at which coded subfiles (subtasks) are spawned are referred to as forking points. We are interested in optimal adaptive strategy such that a linear combination of the mean file retrieval time (mean task completion time) and the mean server utilization is minimized. Specifically, we address the following two interdependent questions: How should we select the forking points? How many coded subfiles (subtasks) should be initiated at each forking point? For this flexible forking strategy with multiple parameters, we analyze the mean of two performance metrics when the random service completion time at each node is independent and distributed identically (i.i.d.) to a shifted exponential. From this study, we find a tradeoff between the metrics which provides insights into the parameter choices. Experiments on Intel DevCloud illustrate that the shifted exponential distribution adequately captures the random coded subtask completion times, and our derived insights continue to hold.