{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,5]],"date-time":"2026-02-05T06:31:34Z","timestamp":1770273094391,"version":"3.49.0"},"reference-count":30,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2022,6,14]],"date-time":"2022-06-14T00:00:00Z","timestamp":1655164800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGOPS Oper. Syst. Rev."],"published-print":{"date-parts":[[2022,6,14]]},"abstract":"<jats:p>Cloud incidents (service interruptions or performance degradation) dramatically degrade the reliability of large-scale cloud systems, causing customer dissatisfaction and revenue loss. With years of efforts, cloud providers are able to solve most incidents automatically and rapidly. The secret of this ability is intelligent incident detection. Only when incidents are detected timely, accurately, and comprehensively, can they be diagnosed and mitigated at a satisfiable speed. To overcome the limitations of traditional rule-based detection, we carried out years of incident detection research. We developed a comprehensive AIOps (Artificial Intelligence for IT Operations) framework for incident detection containing a set of data-driven methods. This paper shares our recent experience of developing and deploying such an intelligent incident detection system at Microsoft. We first discuss the real-world challenges of incident detection that constitute the pain points of engineers. Then, we summarize our intelligent solutions proposed in recent years to tackle these challenges. Finally, we show the deployment of the incident detection AIOps framework and demonstrate its practical benefits conveyed to Microsoft cloud services with real cases.<\/jats:p>","DOI":"10.1145\/3544497.3544499","type":"journal-article","created":{"date-parts":[[2022,6,15]],"date-time":"2022-06-15T10:06:57Z","timestamp":1655287617000},"page":"1-7","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["An Intelligent Framework for Timely, Accurate, and Comprehensive Cloud Incident Detection"],"prefix":"10.1145","volume":"56","author":[{"given":"Yichen","family":"Li","sequence":"first","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong, China"}]},{"given":"Xu","family":"Zhang","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing 100080, China"}]},{"given":"Shilin","family":"He","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing 100080, China"}]},{"given":"Zhuangbin","family":"Chen","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong, China"}]},{"given":"Yu","family":"Kang","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing 100080, China"}]},{"given":"Jinyang","family":"Liu","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong, China"}]},{"given":"Liqun","family":"Li","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing 100080, China"}]},{"given":"Yingnong","family":"Dang","sequence":"additional","affiliation":[{"name":"Microsoft Azure, Redmond, WA 98052, USA"}]},{"given":"Feng","family":"Gao","sequence":"additional","affiliation":[{"name":"Microsoft Azure, Redmond, WA 98052, USA"}]},{"given":"Zhangwei","family":"Xu","sequence":"additional","affiliation":[{"name":"Microsoft Azure, Redmond, WA 98052, USA"}]},{"given":"Saravan","family":"Rajmohan","sequence":"additional","affiliation":[{"name":"Microsoft 365, Redmond, WA 98052, USA"}]},{"given":"Qingwei","family":"Lin","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing 100080, China"}]},{"given":"Dongmei","family":"Zhang","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing 100080, China"}]},{"given":"Michael R.","family":"Lyu","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong, China"}]}],"member":"320","published-online":{"date-parts":[[2022,6,14]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2019. Federal Cloud Computing Strategy. https:\/\/cloud.cio.gov\/.  2019. Federal Cloud Computing Strategy. https:\/\/cloud.cio.gov\/."},{"key":"e_1_2_1_2_1","unstructured":"2022. AWS Post-Event Summaries. https:\/\/aws.amazon.com\/cn\/ premiumsupport\/technology\/pes\/.  2022. AWS Post-Event Summaries. https:\/\/aws.amazon.com\/cn\/ premiumsupport\/technology\/pes\/."},{"key":"e_1_2_1_3_1","unstructured":"2022. Azure status history. https:\/\/status.azure.com\/en-us\/status\/ history\/.  2022. Azure status history. https:\/\/status.azure.com\/en-us\/status\/ history\/."},{"key":"e_1_2_1_4_1","unstructured":"2022. Google Cloud Status Dashboard. https:\/\/status.cloud.google. com\/summary.  2022. Google Cloud Status Dashboard. https:\/\/status.cloud.google. com\/summary."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3324884.3416624"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3368089.3409768"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313501"},{"key":"e_1_2_1_8_1","unstructured":"Zhuangbin Chen Yu Kang Feng Gao Li Yang Jeffrey Sun Zhangwei Xu Pu Zhao Bo Qiao Liqun Li Xu Zhang etal 2020. Aiops innovations of incident management for cloud services. (2020).  Zhuangbin Chen Yu Kang Feng Gao Li Yang Jeffrey Sun Zhangwei Xu Pu Zhao Bo Qiao Liqun Li Xu Zhang et al. 2020. Aiops innovations of incident management for cloud services. (2020)."},{"key":"e_1_2_1_9_1","volume-title":"Lyu","author":"Chen Zhuangbin","year":"2020","unstructured":"Zhuangbin Chen , Yu Kang , Liqun Li , Xu Zhang , Hongyu Zhang , Hui Xu , Yangfan Zhou , Li Yang , Jeffrey Sun , Zhangwei Xu , Yingnong Dang , Feng Gao , Pu Zhao , Bo Qiao , Qingwei Lin , Dongmei Zhang , and Michael R . Lyu . 2020 . Towards Intelligent Incident Management: Why We Need It and How We Make It. Association for Computing Machinery , New York, NY, USA, 1487--1497. Zhuangbin Chen, Yu Kang, Liqun Li, Xu Zhang, Hongyu Zhang, Hui Xu, Yangfan Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, Yingnong Dang, Feng Gao, Pu Zhao, Bo Qiao, Qingwei Lin, Dongmei Zhang, and Michael R. Lyu. 2020. Towards Intelligent Incident Management: Why We Need It and How We Make It. Association for Computing Machinery, New York, NY, USA, 1487--1497."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3338906.3338916"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE-Companion.2019.00023"},{"key":"e_1_2_1_12_1","volume-title":"Emergent failures: Rethinking cloud reliability at scale","author":"Garraghan Peter","year":"2018","unstructured":"Peter Garraghan , Renyu Yang , Zhenyu Wen , Alexander Romanovsky , Jie Xu , Rajkumar Buyya , and Rajiv Ranjan . 2018. Emergent failures: Rethinking cloud reliability at scale . IEEE Cloud Computing 5, 5 ( 2018 ). Peter Garraghan, Renyu Yang, Zhenyu Wen, Alexander Romanovsky, Jie Xu, Rajkumar Buyya, and Rajiv Ranjan. 2018. Emergent failures: Rethinking cloud reliability at scale. IEEE Cloud Computing 5, 5 (2018)."},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE). 292--303","author":"Gu Jiazhen","year":"2020","unstructured":"Jiazhen Gu , Chuan Luo , Si Qin , Bo Qiao , Qingwei Lin , Hongyu Zhang , Ze Li , Yingnong Dang , Shaowei Cai , Wei Wu , 2020 . Efficient incident identification from multi-dimensional issue reports via metaheuristic search . In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE). 292--303 . Jiazhen Gu, Chuan Luo, Si Qin, Bo Qiao, Qingwei Lin, Hongyu Zhang, Ze Li, Yingnong Dang, Shaowei Cai, Wei Wu, et al. 2020. Efficient incident identification from multi-dimensional issue reports via metaheuristic search. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE). 292--303."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3368089.3417061"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3236024.3236083"},{"key":"e_1_2_1_16_1","volume-title":"Data sovereignty: A review","author":"Hummel Patrik","year":"2021","unstructured":"Patrik Hummel , Matthias Braun , Max Tretter , and Peter Dabrock . 2021. Data sovereignty: A review . Big Data & Society 8, 1 ( 2021 ). Patrik Hummel, Matthias Braun, Max Tretter, and Peter Dabrock. 2021. Data sovereignty: A review. Big Data & Society 8, 1 (2021)."},{"key":"e_1_2_1_17_1","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 1155--1170","author":"Levy Sebastien","year":"2020","unstructured":"Sebastien Levy , Randolph Yao , Youjiang Wu , Yingnong Dang , Peng Huang , Zheng Mu , Pu Zhao , Tarun Ramani , Naga Govindaraju , Xukun Li , Qingwei Lin , Gil Lapid Shafriri , and Murali Chintalapati . 2020 . Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions . In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 1155--1170 . Sebastien Levy, Randolph Yao, Youjiang Wu, Yingnong Dang, Peng Huang, Zheng Mu, Pu Zhao, Tarun Ramani, Naga Govindaraju, Xukun Li, Qingwei Lin, Gil Lapid Shafriri, and Murali Chintalapati. 2020. Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 1155--1170."},{"key":"e_1_2_1_18_1","volume-title":"Fighting the Fog of War: Automated Incident Detection for Cloud Systems. In 2021 USENIX Annual Technical Conference (USENIX ATC 21)","author":"Li Liqun","year":"2021","unstructured":"Liqun Li , Xu Zhang , Xin Zhao , Hongyu Zhang , Yu Kang , Pu Zhao , Bo Qiao , Shilin He , Pochian Lee , Jeffrey Sun , 2021 . Fighting the Fog of War: Automated Incident Detection for Cloud Systems. In 2021 USENIX Annual Technical Conference (USENIX ATC 21) . 131--146. Liqun Li, Xu Zhang, Xin Zhao, Hongyu Zhang, Yu Kang, Pu Zhao, Bo Qiao, Shilin He, Pochian Lee, Jeffrey Sun, et al. 2021. Fighting the Fog of War: Automated Incident Detection for Cloud Systems. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 131--146."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3236024.3236060"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2884781.2884795"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3442381.3449867"},{"key":"e_1_2_1_22_1","unstructured":"Si Qin Yong Xu Shandan Zhou Qingwei Lin Hongyu Zhang Saurabh Agarwal Karthikeyan Subramanian Eli Cortez John Miller Chris Cowdery etal 2020. Prediction-Guided Design for Software Systems. (2020).  Si Qin Yong Xu Shandan Zhou Qingwei Lin Hongyu Zhang Saurabh Agarwal Karthikeyan Subramanian Eli Cortez John Miller Chris Cowdery et al. 2020. Prediction-Guided Design for Software Systems. (2020)."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSRE52982.2021.00017"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE43902.2021.00085"},{"key":"e_1_2_1_25_1","volume-title":"2018 USENIX Annual Technical Conference (USENIX ATC). 481--494","author":"Xu Yong","year":"2018","unstructured":"Yong Xu , Kaixin Sui , Randolph Yao , Hongyu Zhang , Qingwei Lin , Yingnong Dang , Peng Li , Keceng Jiang , Wenchi Zhang , Jian-Guang Lou , Murali Chintalapati , and Dongmei Zhang . 2018 . Improving Service Availability of Cloud Systems by Predicting Disk Error . In 2018 USENIX Annual Technical Conference (USENIX ATC). 481--494 . Yong Xu, Kaixin Sui, Randolph Yao, Hongyu Zhang, Qingwei Lin, Yingnong Dang, Peng Li, Keceng Jiang,Wenchi Zhang, Jian-Guang Lou, Murali Chintalapati, and Dongmei Zhang. 2018. Improving Service Availability of Cloud Systems by Predicting Disk Error. In 2018 USENIX Annual Technical Conference (USENIX ATC). 481--494."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447548.3467190"},{"key":"e_1_2_1_27_1","volume-title":"2019 USENIX Annual Technical Conference (USENIX ATC). 1063--1076","author":"Zhang Xu","year":"2019","unstructured":"Xu Zhang , Junghyun Kim , Qingwei Lin , Keunhak Lim , Shobhit O Kanaujia , Yong Xu , Kyle Jamieson , Aws Albarghouthi , Si Qin , Michael J Freedman , 2019 . Cross-dataset time series anomaly detection for cloud systems . In 2019 USENIX Annual Technical Conference (USENIX ATC). 1063--1076 . Xu Zhang, Junghyun Kim, Qingwei Lin, Keunhak Lim, Shobhit O Kanaujia, Yong Xu, Kyle Jamieson, Aws Albarghouthi, Si Qin, Michael J Freedman, et al. 2019. Cross-dataset time series anomaly detection for cloud systems. In 2019 USENIX Annual Technical Conference (USENIX ATC). 1063--1076."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3338906.3338931"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3468264.3473919"},{"key":"e_1_2_1_30_1","unstructured":"Pu Zhao Chuan Luo Bo Qiao Youjiang Wu Yingnong Dang Murali Chintalapati Susy Yi Paul Wang Andrew Zhou Saravanakumar Rajmohan etal 2021. F3: Fault Forecasting Framework for Cloud Systems. (2021).  Pu Zhao Chuan Luo Bo Qiao Youjiang Wu Yingnong Dang Murali Chintalapati Susy Yi Paul Wang Andrew Zhou Saravanakumar Rajmohan et al. 2021. F3: Fault Forecasting Framework for Cloud Systems. (2021)."}],"container-title":["ACM SIGOPS Operating Systems Review"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3544497.3544499","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3544497.3544499","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:02:54Z","timestamp":1750186974000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3544497.3544499"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,14]]},"references-count":30,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,6,14]]}},"alternative-id":["10.1145\/3544497.3544499"],"URL":"https:\/\/doi.org\/10.1145\/3544497.3544499","relation":{},"ISSN":["0163-5980"],"issn-type":[{"value":"0163-5980","type":"print"}],"subject":[],"published":{"date-parts":[[2022,6,14]]},"assertion":[{"value":"2022-06-14","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}