Reliability Modelling Of Whole RAID Storage Subsystems
Abstract
Reliability modelling of RAID storage systems with its various components such as RAID controllers, enclosures, expanders, interconnects and disks is important from a storage system designer's point of view. A model that can express all the failure characteristics of the whole RAID storage system can be used to evaluate design choices, perform cost reliability trade-offs and conduct sensitivity analyses.
We present a reliability model for RAID storage systems where we try to model all the components as accurately as possible. We use several state-space reduction techniques, such as aggregating all in-series components and hierarchical decomposition, to reduce the size of our model. To automate computation of reliability, we use the PRISM model checker as a CTMC solver where appropriate.
Initially, we assume a simple 3-state disk reliability model with independent disk failures. Later, we assume a Weibull model for the disks; we also consider a correlated disk failure model to check correspondence with the field data available. For all other components in the system, we assume exponential failure distribution. To use the CTMC solver, we approximate the Weibull distribution for a disk using sum of exponentials and we first confirm that this model gives results that are in reasonably good agreement with those from the sequential Monte Carlo simulation methods for RAID disk subsystems.
Next, our model for whole RAID storage systems (that includes, for example, disks, expanders, enclosures) uses Weibull distributions and, where appropriate, correlated failure modes for disks, and exponential distributions with independent failure modes for all other components. Since the CTMC solver cannot handle the size of the resulting models, we solve such models using hierarchical decomposition technique. We are able to model fairly large configurations with upto 600 disks using this model.
We can use such reasonably complete models to conduct several "what-if" analyses for many RAID storage systems of interest. Our results show that, depending on the configuration, spanning a RAID group across enclosures may increase or decrease reliability. Another key finding from our model results is that redundancy mechanisms such as multipathing is beneficial only if a single failure of some other component does not cause data inaccessibility of a whole RAID group.
Collections
Related items
Showing items related by title, author, creator and subject.
-
Spatially Correlated Data Accuracy Estimation Models in Wireless Sensor Networks
Karjee, Jyotirmoy (2018-02-10)One of the major applications of wireless sensor networks is to sense accurate and reliable data from the physical environment with or without a priori knowledge of data statistics. To extract accurate data from the physical ... -
A Hydroclimatological Change Detection and Attribution Study over India using CMIP5 Models
Pattanayak, Sonali (2017-11-14)As a result of increase in global average surface temperature, abnormalities in different hydroclimatic components such as evapotranspiration, stream flow and precipitation have been experienced. So investigation has to ... -
On the Tradeoff Of Average Delay, Average Service Cost, and Average Utility for Single Server Queues with Monotone Policies
Sukumaran, Vineeth Bala (2018-04-23)In this thesis, we study the tradeoff of average delay with average service cost and average utility for both continuous time and discrete time single server queueing models without and with admission control. The continuous ...