End-to-end Resiliency Analysis Framework for Cloud Storage Services
Author
Ghosh, Archita
Metadata
Show full item recordAbstract
Cloud storage service brought the idea of a global scale storage system available on-demand and accessible from anywhere. Despite the benefits, resiliency remains one of the key issues that hinder the wide adaptation of storage services. The data is hosted on cloud data centers containing hundreds of thousands of commodity-grade hardware with layers of complex software. Failures due to system crashes, natural disasters, cyber-attacks, etc., are common and frequent in such environments. To keep the service unaffected by such events, resiliency is essential for cloud systems. For storage services, resiliency is far more critical because losing access to data or, more importantly, a complete data loss can have a catastrophic impact on the client.
The existing works on storage resiliency focus on maintaining sufficient user data redundancy in the system to maintain a reliable service. However, providing a global-scale storage solution requires various functional and management layers to ensure the service is accessible and all the stored items are durable. The first part of our work proves that resiliency at the stored data level does not guarantee service level reliability. A generic cloud storage system model is designed to analytically show that the reliability achieved at the service level drastically differs from the reliability ensured by stored data redundancy. This motivates us to bring the entire system into purview to understand cloud storage resiliency.
Due to the complexity and variation of large-scale storage architectures, assessing end-to-end storage resiliency is a challenging task. To achieve this, the second part of the work proposes a generic resiliency evaluation method for cloud storage services. The method identifies the essential functional layers for storage service and the components constituting the layers. It then performs an in-depth behavior analysis during all possible failures of each component. The method is used to assess the resiliency of two diverse and real-world cloud storage services, OpenStack Swift and CephFS. The analysis identifies various resiliency weak points in the service architectures and depicts the effectiveness of different resiliency methods used at various layers.
The third part of the work extends the resiliency evaluation method to understand the correlation of resiliency with the service usage pattern. A storage service can be used for different use cases resulting in the variation of request interarrival time, read and write ratio, accessed data and metadata, etc. Hence, the components involved in access sequences may differ, and so can their failure impact. Using the improved resiliency evaluation method and access patterns identified from real traces, we show that resiliency can be selective and dynamically adjusted based on workloads without affecting service reliability.
Finally, the work defines an end-to-end resiliency analysis framework for cloud storage services that enables quantification, comparison, and optimization of cloud storage resiliency. The framework allows effective modeling of cloud storage resilience by combining the resiliency of each component participating in service reliability maintenance for specific workloads. The framework successfully models the resiliency of OpenStack Swift and CephFS as Stochastic Petri Nets (SPNs). The models are used to quantify and compare the resiliency of the above two service architectures and demonstrate how to optimize resiliency while achieving expected service reliability.