Reliability analysis of self-healing network using discrete-event simulation

Thara Angskun, George Bosilca, Graham Fagg, Jelena Pješivac-Grbović, Jack J. Dongarra

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    The number of processors embedded on high performance computing platforms is continuously increasing to accommodate user desire to solve larger and more complex problems. However, as the number of components increases, so does the probability of failure. Thus, both scalable and fault-tolerance of software are important issues in this field. To ensure reliability of the software especially under the failure circumstance, the reliability analysis is needed. The discrete-event simulation technique offers an attractive alternative to traditional Markovian-based analytical models, which often have an intractably large state space. In this paper, we analyze reliability of a self-healing network developed for parallel runtime environments using discreteevent simulation. The network is designed to support transmission of messages across multiple nodes and at the same time, to protect against node and process failures. Results demonstrate the flexibility of a discrete-event simulation approach for studying the network behavior under failure conditions and various protocol parameters, message types, and routing algorithms. © 2007 IEEE.
    Original languageEnglish
    Title of host publicationProceedings - Seventh IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2007|Proc. Seventh IEEE International Symposium on Cluster Computing Grid
    PublisherIEEE Computer Society
    Pages437-444
    Number of pages7
    ISBN (Print)0769528333, 9780769528335
    DOIs
    Publication statusPublished - 2007
    Event7th IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2007 - Rio de Janeiro
    Duration: 1 Jul 2007 → …

    Conference

    Conference7th IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2007
    CityRio de Janeiro
    Period1/07/07 → …

    Fingerprint

    Dive into the research topics of 'Reliability analysis of self-healing network using discrete-event simulation'. Together they form a unique fingerprint.

    Cite this