Optimal Redundancy in Distributed Systems for Latency and Repair
Abstract
Distributed systems are employed in many modern storage and computing architecture for greater reliability and cost-efficiency. There are several important considerations in the design and implementation of such distributed systems, such as latency, availability, storage cost, among others. We look into the interplay between redundancy and the parameters, latency and repair for the design of such systems.
Data is replicated and stored redundantly over multiple servers for availability in distributed databases. In this thesis, we first study the impact of redundancy on system latency in distributed databases. We focus on databases with frequent reads and writes, where both read and write latencies are important. This is in contrast to databases designed primarily for either read or write applications. Redundancy has contrasting effects on read and write latency. Read latency can be reduced by potential parallel access from multiple servers, whereas write latency increases as a larger number of replicas have to be updated. We quantify this tradeoff between read and write latency as a function of redundancy and provide a closed-form approximation when the request arrival is Poisson and the service is memoryless. We empirically show that this approximation is tight across all ranges of system parameters. Thus, we provide guidelines for redundancy selection in distributed databases. Further, we demonstrate the existence of optimal redundancy even in systems with non-memoryless service through simulations and experiments on practical systems.
Secondly, we propose a redundant scheme that introduces repair locality in distributed computation. In contrast to the previous model, we consider performing computations other than simple read and write across the distributed system. In distributed computing, a computation job is split into multiple tasks, and the tasks are executed over multiple nodes. In such a setting, slow compute nodes, referred to as stragglers, pose a bottleneck on the job computation time. For distributed computing of multivariate polynomials, Lagrange coded computing (LCC) proposed in literature tolerates an optimal number of stragglers while providing security against adversaries. Introducing repair locality in this setting reduces the number of worker nodes contacted to recover a particular computation and allows clients/intermediate nodes to compute individual function outputs. We propose a distributed polynomial computing scheme, where data is encoded using Tamo-Barg codes with carefully chosen parameters. We show that the scheme tolerates a larger number of stragglers when compared to the repeated LCC and Product Lagrange coded schemes for certain parameters. Furthermore, we provide an alternate proof for the optimality of LCC for multilinear functions based on properties of multivariate polynomial interpolation.