TY - JOUR
T1 - Fault-tolerant matrix operations for networks of workstations using diskless checkpointing
AU - Plank, James S.
AU - Kim, Youngbae
AU - Dongarra, Jack J.
PY - 1997/6/15
Y1 - 1997/6/15
N2 - Networks of workstations (NOWs) offer a cost-effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless checkpointing, a paradigm that uses processor redundancy rather than stable storage as the fault-tolerant medium. These algorithms are able to run on clusters of workstations that change over time due to failure, load, or availability. As long as there are at leastnprocessors in the cluster, and failures occur singly, the computation will complete in an efficient manner. We discuss the details of how the algorithms are tuned for fault-tolerance and present the performance results on a PVM network of Sun workstations connected by a fast, switched ethernet. © 1997 Academic Press.
AB - Networks of workstations (NOWs) offer a cost-effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless checkpointing, a paradigm that uses processor redundancy rather than stable storage as the fault-tolerant medium. These algorithms are able to run on clusters of workstations that change over time due to failure, load, or availability. As long as there are at leastnprocessors in the cluster, and failures occur singly, the computation will complete in an efficient manner. We discuss the details of how the algorithms are tuned for fault-tolerance and present the performance results on a PVM network of Sun workstations connected by a fast, switched ethernet. © 1997 Academic Press.
U2 - 10.1006/jpdc.1997.1336
DO - 10.1006/jpdc.1997.1336
M3 - Article
SN - 0743-7315
VL - 43
SP - 125
EP - 138
JO - Journal of Parallel and Distributed Computing
JF - Journal of Parallel and Distributed Computing
IS - 2
ER -