TY - JOUR
T1 - Unified model for assessing checkpointing protocols at extreme-scale
AU - Bosilca, George
AU - Bouteiller, Aurélien
AU - Brunet, Elisabeth
AU - Cappello, Franck
AU - Dongarra, Jack
AU - Guermouche, Amina
AU - Herault, Thomas
AU - Robert, Yves
AU - Vivien, Frédéric
AU - Zaidouni, Dounia
PY - 2014/11/4
Y1 - 2014/11/4
N2 - In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them, and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available high performance computing platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation.
AB - In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them, and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available high performance computing platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation.
KW - Checkpoint/restart
KW - Checkpointing waste optimization problem
KW - Coordinated checkpoint
KW - Hierarchical checkpoint with message logging
UR - http://www.scopus.com/inward/record.url?scp=84907933728&partnerID=8YFLogxK
U2 - 10.1002/cpe.3173
DO - 10.1002/cpe.3173
M3 - Article
AN - SCOPUS:84907933728
SN - 1532-0626
VL - 26
SP - 2772
EP - 2791
JO - Concurrency and Computation: Practice & Experience
JF - Concurrency and Computation: Practice & Experience
IS - 17
ER -