Prediction-based failure management for supercomputers

  • Wuxiang Ge

Student thesis: Phd


The growing requirements of a diversity of applications necessitate the deployment of large and powerful computing systems and failures in these systems may cause severe damage in every aspect from loss of human lives to world economy. However, current fault tolerance techniques cannot meet the increasing requirements for reliability. Thus new solutions are urgently needed and research on proactive schemes is one of the directions that may offer better efficiency. This thesis proposes a novel proactive failure management framework. Its goal is to reduce the failure penalties and improve fault tolerance efficiency in supercomputers when running complex applications. The proposed proactive scheme builds on two core components: failure prediction and proactive failure recovery. More specifically, the failure prediction component is based on the assessment of system events and employs semi-Markov models to capture the dependencies between failures and other events for the forecasting of forthcoming failures. Furthermore, a two-level failure prediction strategy is described that not only estimates the future failure occurrence but also identifies the specific failure categories. Based on the accurate failure forecasting, a prediction-based coordinated checkpoint mechanism is designed to construct extra checkpoints just before each predicted failure occurrence so that the wasted computational time can be significantly reduced. Moreover, a theoretical model has been developed to assess the proactive scheme that enables calculation of the overall wasted computational time.The prediction component has been applied to industrial data from the IBM BlueGene/L system. Results of the failure prediction component show a great improvement of the prediction accuracy in comparison with three other well-known prediction approaches, and also demonstrate that the semi-Markov based predictor, which has achieved the precision of 87.41\% and the recall of 77.95\%, performs better than other predictors.
Date of Award31 Dec 2011
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorJohn Gurd (Supervisor) & John Keane (Supervisor)


  • Failure prediction
  • Semi-Markov Conditional random field

Cite this