Outage prediction and diagnosis for cloud service systems

Yujun Chen, Xian Yang, Qingwei Lin, Hongyu Zhang, Feng Gao, Zhangwei Xu, Yingnong Dang, Dongmei Zhang, Hang Dong, Yong Xu, Hao Li, Kang Yu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

With the rapid growth of cloud service systems and their increasing complexity, service failures become unavoidable. Outages, which are critical service failures, could dramatically degrade system availability and impact user experience. To minimize service downtime and ensure high system availability, we develop an intelligent outage management approach, called AirAlert, which can forecast the occurrence of outages before they actually happen and diagnose the root cause after they indeed occur. AirAlert works as a global watcher for the entire cloud system, which collects all alerting signals, detects dependency among signals and proactively predicts outages that may happen anywhere in the whole cloud system. We analyze the relationships between outages and alerting signals by leveraging Bayesian network and predict outages using a robust gradient boosting tree based classification method. The proposed outage management approach is evaluated using the outage dataset collected from a Microsoft cloud system and the results confirm the effectiveness of the proposed approach.
Original languageEnglish
Title of host publicationWWW '19: The World Wide Web Conference
Subtitle of host publication2019 Proceedings
EditorsLing Liu, Ryen White
Place of PublicationNew York
PublisherAssociation for Computing Machinery
Pages2659-2665
Number of pages7
ISBN (Print)9781450366748
Publication statusPublished - 13 May 2019

Keywords

  • Outage prediction
  • outage diagnosis
  • cloud system
  • system of systems
  • service availability

Fingerprint

Dive into the research topics of 'Outage prediction and diagnosis for cloud service systems'. Together they form a unique fingerprint.

Cite this