Fault tolerant high performance computing by a coding approach

Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca, Jack Dongarra

    Research output: Chapter in Book/Conference proceedingConference contribution

    Abstract

    As the number of processors in today's high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the execution time of many current high performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete system failure, most today's high performance computing applications can not survive node failures and, therefore, whenever a node fails, have to abort themselves and restart from the beginning or a stable-storage-based checkpoint. This paper explores the use of the floating-point arithmetic coding approach to build fault survivable high performance computing applications so that they can adapt to node failures without aborting themselves. Despite the use of erasure codes over Galois field has been theoretically attempted before in diskless checkpointing, few actual implementations exist. This probably derives from concerns related to both the efficiency and the complexity of implementing such codes in high performance computing applications. In this paper, we introduce the simple but efficient floating-point arithmetic coding approach into diskless checkpointing and address the associated round-off error issue. We also implement a floating-point arithmetic version of the Reed-Solomon coding scheme into a conjugate gradient equation solver and evaluate both the performance and the numerical impact of this scheme. Experimental results demonstrate that the proposed floating-point arithmetic coding approach is able to survive a small number of simultaneous node failures with low performance overhead and little numerical impact. Copyright 2005 ACM.
    Original languageEnglish
    Title of host publicationProceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP|Proc ACM SIGPLAN Symp Prins Pract Parall Program PPOPP
    PublisherAssociation for Computing Machinery
    Pages213-223
    Number of pages10
    Publication statusPublished - 2005
    Event2005 ACM SIGPLAN Symposium on Principles and Practise of Parallel Programming, PROPP 05 - Chicago, IL
    Duration: 1 Jul 2005 → …
    http://dblp.uni-trier.de/db/conf/ppopp/ppopp2005.html#ChenFGLABD05http://dblp.uni-trier.de/rec/bibtex/conf/ppopp/ChenFGLABD05.xmlhttp://dblp.uni-trier.de/rec/bibtex/conf/ppopp/ChenFGLABD05

    Conference

    Conference2005 ACM SIGPLAN Symposium on Principles and Practise of Parallel Programming, PROPP 05
    CityChicago, IL
    Period1/07/05 → …
    Internet address

    Keywords

    • Fault Tolerance
    • Floating-Point Arithmetic Coding
    • High Performance Computing
    • Message Passing Interface

    Fingerprint

    Dive into the research topics of 'Fault tolerant high performance computing by a coding approach'. Together they form a unique fingerprint.

    Cite this