HARNESS and fault tolerant MPI

Graham E. Fagg, Antonin Bukovsky, Jack J. Dongarra

    Research output: Contribution to journalArticlepeer-review

    Abstract

    Initial versions of MPI were designed to work efficiently on multi-processors which had very little job control and thus static process models. Subsequently forcing them to support a dynamic process model would have affected their performance. As current HPC systems increase in size with greater potential levels of individual node failure, the need arises for new fault tolerant systems to be developed. Here we present a new implementation of MPI called fault tolerant MPI (FT-MPI) that allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified MPI API. Given is an overview of the FT-MPI semantics, design, example applications, debugging tools and some performance issues. Also discussed is the experimental HARNESS core (G_HCORE) implementation that FT-MPI is built to operate upon. © 2001 Elsevier Science B.V. All rights reserved.
    Original languageEnglish
    Pages (from-to)1479-1495
    Number of pages16
    JournalParallel Computing
    Volume27
    Issue number11
    DOIs
    Publication statusPublished - Oct 2001

    Keywords

    • Fault tolerant application
    • Message passing
    • Metacomputing middleware
    • Parallel computing

    Fingerprint

    Dive into the research topics of 'HARNESS and fault tolerant MPI'. Together they form a unique fingerprint.

    Cite this