A scalable checkpoint encoding algorithm for diskless checkpointing

Zizhong Chen, Jack Dongarra

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable storage. In this paper, we introduce several scalable encoding strategies into diskless checkpointing and reduce the overhead to survive k failures in p processes from 2[log p].k((β+ 2γ)m + α) to (1 + 0(1/√m)).k{β + 2γ)m, where α is the communication latency, 1/β is the network bandwidth between processes. 1/γ is the rate to perform calculations, and m is the size of local checkpoint per process. The introduced algorithm is scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of the introduced algorithm by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that the introduced techniques are highly scalable. © 2008 IEEE.
    Original languageEnglish
    Title of host publicationProceedings of IEEE International Symposium on High Assurance Systems Engineering|Proc. IEEE Int. Symp. High Assur. Syst. Eng.
    PublisherIEEE Computer Society
    Pages71-79
    Number of pages8
    ISBN (Print)9780769534824
    DOIs
    Publication statusPublished - 2008
    Event11th IEEE High Assurance Systems Engineering Symposium, HASE 2008 - Nanjing
    Duration: 1 Jul 2008 → …
    http://dblp.uni-trier.de/db/conf/hase/hase2008.html#ChenD08http://dblp.uni-trier.de/rec/bibtex/conf/hase/ChenD08.xmlhttp://dblp.uni-trier.de/rec/bibtex/conf/hase/ChenD08

    Conference

    Conference11th IEEE High Assurance Systems Engineering Symposium, HASE 2008
    CityNanjing
    Period1/07/08 → …
    Internet address

    Keywords

    • Checkpoint
    • Diskless checkpointing
    • Fault tolerance
    • High performance computing
    • Parallel and distributed systems
    • Reed-solomon encoding

    Fingerprint

    Dive into the research topics of 'A scalable checkpoint encoding algorithm for diskless checkpointing'. Together they form a unique fingerprint.

    Cite this