Show simple item record

dc.contributor.advisorVadhiyar, Sathish
dc.contributor.authorGeorge, Cijo
dc.date.accessioned2018-03-07T14:04:50Z
dc.date.accessioned2018-07-31T05:09:16Z
dc.date.available2018-03-07T14:04:50Z
dc.date.available2018-07-31T05:09:16Z
dc.date.issued2018-03-07
dc.date.submitted2012
dc.identifier.urihttps://etd.iisc.ac.in/handle/2005/3240
dc.identifier.abstracthttp://etd.iisc.ac.in/static/etd/abstracts/4101/G25573-Abs.pdfen_US
dc.description.abstractExascale systems of the future are predicted to have mean time between node failures (MTBF) of less than one hour. At such low MTBF, the number of processors available for execution of a long running application can widely vary throughout the execution of the application. Employing traditional fault tolerance strategies like periodic checkpointing in these highly dynamic environments may not be effective because of the high number of application failures, resulting in large amount of work lost due to rollbacks apart from the increased recovery overheads. In this context, it is highly necessary to have fault tolerance strategies that can adapt to the changing node availability and also help avoid significant number of application failures. In this thesis, we present two adaptive fault tolerance strategies that make use of node failure pre-diction mechanisms to provide proactive fault tolerance for long running parallel applications on large scale systems. The first part of the thesis deals with an adaptive fault tolerance strategy for malleable applications. We present ADFT, an adaptive fault tolerance framework for long running malleable applications to maximize application performance in the presence of failures. We first develop cost models that consider different factors like accuracy of node failure predictions and application scalability, for evaluating the benefits of various fault tolerance actions including check-pointing, live-migration and rescheduling. Our adaptive framework then uses the cost models to make runtime decisions for dynamically selecting the fault tolerance actions at different points of application execution to minimize application failures and maximize performance. Simulations with real and synthetic failure traces show that our approach outperforms existing fault tolerance mechanisms for malleable applications yielding up to 23% improvement in work done by the application in the presence of failures, and is effective even for petascale and exascale systems. In the second part of the thesis, we present a fault tolerance strategy using adaptive process replication that can provide fault tolerance for applications using partial replication of a set of application processes. This fault tolerance framework adaptively changes the set of replicated processes (replicated set) periodically based on node failure predictions to avoid application failures. We have developed an MPI prototype implementation, PAREP-MPI that allows dynamically changing the replicated set of processes for MPI applications. Experiments with real scientific applications on real systems have shown that the overhead of PAREP-MPI is minimal. We have shown using simulations with real and synthetic failure traces that our strategy involving adaptive process replication significantly outperforms existing mechanisms providing up to 20% improvement in application efficiency even for exascale systems. Significant observations are also made which can drive future research efforts in fault tolerance for large and very large scale systems.en_US
dc.language.isoen_USen_US
dc.relation.ispartofseriesG25573en_US
dc.subjectFault-tolerant Computingen_US
dc.subjectLarge Scale Systemsen_US
dc.subjectAdaptive Fault Toleranceen_US
dc.subjectAdaptive Process Replicationen_US
dc.subjectLarge Scale Systems - Fault Toleranceen_US
dc.subjectMalleability and Reschedulingen_US
dc.subjectLarge Scale Parallel Systemsen_US
dc.subjectProactive Fault Toleranceen_US
dc.subjectHigh Performance Computingen_US
dc.subjectAdaptive Fault Managementen_US
dc.subjectParallel Proessing (Electronic Computers)en_US
dc.subject.classificationComputer Scienceen_US
dc.titleAdaptive Fault Tolerance Strategies for Large Scale Systemsen_US
dc.typeThesisen_US
dc.degree.nameMSc Enggen_US
dc.degree.levelMastersen_US
dc.degree.disciplineFaculty of Engineeringen_US


Files in this item

This item appears in the following Collection(s)

Show simple item record