Adaptive Fault Tolerance Strategies for Large Scale Systems

George, Cijo

dc.contributor.advisor	Vadhiyar, Sathish
dc.contributor.author	George, Cijo
dc.date.accessioned	2018-03-07T14:04:50Z
dc.date.accessioned	2018-07-31T05:09:16Z
dc.date.available	2018-03-07T14:04:50Z
dc.date.available	2018-07-31T05:09:16Z
dc.date.issued	2018-03-07
dc.date.submitted	2012
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/3240
dc.identifier.abstract	https://etd.iisc.ac.in/static/etd/abstracts/4101/G25573-Abs.pdf	en_US
dc.description.abstract	Exascale systems of the future are predicted to have mean time between node failures (MTBF) of less than one hour. At such low MTBF, the number of processors available for execution of a long running application can widely vary throughout the execution of the application. Employing traditional fault tolerance strategies like periodic checkpointing in these highly dynamic environments may not be effective because of the high number of application failures, resulting in large amount of work lost due to rollbacks apart from the increased recovery overheads. In this context, it is highly necessary to have fault tolerance strategies that can adapt to the changing node availability and also help avoid significant number of application failures. In this thesis, we present two adaptive fault tolerance strategies that make use of node failure pre-diction mechanisms to provide proactive fault tolerance for long running parallel applications on large scale systems. The first part of the thesis deals with an adaptive fault tolerance strategy for malleable applications. We present ADFT, an adaptive fault tolerance framework for long running malleable applications to maximize application performance in the presence of failures. We first develop cost models that consider different factors like accuracy of node failure predictions and application scalability, for evaluating the benefits of various fault tolerance actions including check-pointing, live-migration and rescheduling. Our adaptive framework then uses the cost models to make runtime decisions for dynamically selecting the fault tolerance actions at different points of application execution to minimize application failures and maximize performance. Simulations with real and synthetic failure traces show that our approach outperforms existing fault tolerance mechanisms for malleable applications yielding up to 23% improvement in work done by the application in the presence of failures, and is effective even for petascale and exascale systems. In the second part of the thesis, we present a fault tolerance strategy using adaptive process replication that can provide fault tolerance for applications using partial replication of a set of application processes. This fault tolerance framework adaptively changes the set of replicated processes (replicated set) periodically based on node failure predictions to avoid application failures. We have developed an MPI prototype implementation, PAREP-MPI that allows dynamically changing the replicated set of processes for MPI applications. Experiments with real scientific applications on real systems have shown that the overhead of PAREP-MPI is minimal. We have shown using simulations with real and synthetic failure traces that our strategy involving adaptive process replication significantly outperforms existing mechanisms providing up to 20% improvement in application efficiency even for exascale systems. Significant observations are also made which can drive future research efforts in fault tolerance for large and very large scale systems.	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	G25573	en_US
dc.subject	Fault-tolerant Computing	en_US
dc.subject	Large Scale Systems	en_US
dc.subject	Adaptive Fault Tolerance	en_US
dc.subject	Adaptive Process Replication	en_US
dc.subject	Large Scale Systems - Fault Tolerance	en_US
dc.subject	Malleability and Rescheduling	en_US
dc.subject	Large Scale Parallel Systems	en_US
dc.subject	Proactive Fault Tolerance	en_US
dc.subject	High Performance Computing	en_US
dc.subject	Adaptive Fault Management	en_US
dc.subject	Parallel Proessing (Electronic Computers)	en_US
dc.subject.classification	Computer Science	en_US
dc.title	Adaptive Fault Tolerance Strategies for Large Scale Systems	en_US
dc.type	Thesis	en_US
dc.degree.name	MSc Engg	en_US
dc.degree.level	Masters	en_US
dc.degree.discipline	Faculty of Engineering	en_US