SRS: A framework for developing malleable and migratable parallel applications for distributed systems

Sathish S. Vadhiyar, Jack J. Dongarra

    Research output: Contribution to journalArticlepeer-review

    Abstract

    The ability to produce malleable parallel applications that can be stopped and reconfigured during the execution can offer attractive benefits for both the system and the applications. The reconfiguration can be in terms of varying the parallelism for the applications, changing the data distributions during the executions or dynamically changing the software components involved in the application execution. In distributed and Grid computing systems, migration and reconfiguration of such malleable applications across distributed heterogeneous sites which do not share common file systems provides flexibility for scheduling and resource management in such distributed environments. The present reconfiguration systems do not support migration of parallel applications to distributed locations. In this paper, we discuss a framework for developing malleable and migratable MPI message-passing parallel applications for distributed systems. The framework includes a user-level checkpointing library called SRS and a runtime support system that manages the check-pointed data for distribution to distributed locations. Our experiments and results indicate that the parallel applications, with instrumentation to SRS library, were able to achieve reconfigurability incurring about 15-35% overhead.
    Original languageEnglish
    Pages (from-to)291-312
    Number of pages21
    JournalParallel Processing Letters
    Volume13
    Issue number2
    DOIs
    Publication statusPublished - Jun 2003

    Keywords

    • Checkpointing
    • Distributed
    • Malleable
    • Migrati on
    • MPI
    • Parallel
    • Reconfiguration

    Fingerprint

    Dive into the research topics of 'SRS: A framework for developing malleable and migratable parallel applications for distributed systems'. Together they form a unique fingerprint.

    Cite this