Recovery patterns for iterative methods in a parallel unstable environment

J. Langou, Z. Chen, G. Bosilca, J. Dongarra

    Research output: Contribution to journalArticlepeer-review

    Abstract

    Several recovery techniques for parallel iterative methods are presented. First, the implementation of checkpoints in parallel iterative methods is described and analyzed. Then a simple checkpoint-free fault-tolerant scheme for parallel iterative methods, the lossy approach, is presented. When one processor fails and all its data is lost, the system is recovered by computing a new approximate solution using the data of the nonfailed processors. The iterative method is then restarted with this new vector. The main advantage of the lossy approach over standard checkpoint algorithms is that it does not increase the computational cost of the iterative solver when no failure occurs. Experiments are presented that compare the different techniques. The fault-tolerant FT-MPI library is used. Both iterative linear solvers and eigensolvers are considered. © 2007 Society for Industrial and Applied Mathematics.
    Original languageEnglish
    Pages (from-to)102-116
    Number of pages14
    JournalSIAM Journal on Scientific Computing
    Volume30
    Issue number1
    DOIs
    Publication statusPublished - 2007

    Keywords

    • Fault-tolerant algorithms
    • Iterative methods
    • Parallel distributed

    Fingerprint

    Dive into the research topics of 'Recovery patterns for iterative methods in a parallel unstable environment'. Together they form a unique fingerprint.

    Cite this