{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,23]],"date-time":"2025-08-23T00:40:08Z","timestamp":1755909608016,"version":"3.44.0"},"reference-count":50,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2023,12,7]],"date-time":"2023-12-07T00:00:00Z","timestamp":1701907200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100006374","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62072006,92167104"],"award-info":[{"award-number":["62072006,92167104"]}],"id":[{"id":"10.13039\/501100006374","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Qiyuan Lab Innovation Fund","award":["B0211"],"award-info":[{"award-number":["B0211"]}]},{"name":"ByteDance University Research Project"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Meas. Anal. Comput. Syst."],"published-print":{"date-parts":[[2023,12,7]]},"abstract":"<jats:p>This study demonstrates the salient facts and challenges of host failure operations in hyperscale data centers. A host incident can involve hundreds of distinct host-level metrics, covering broad aspects. The faulting mechanism inside the host connects these heterogeneous metrics through direct and indirect correlation, making it extremely difficult to sort out the propagation procedures and the root cause from these intertwined indicators. To deeply understand the failure mechanism inside the host, we develop HEAL -- a novel host metrics analysis toolkit. HEAL synergistically discovers dynamic causality in sparse heterogeneous host metrics by combining the strengths of both time series and random variable analysis. It can also proactively extract causal directional hints from causality's asymmetry and historical knowledge. Together, these breakthroughs help HEAL produce accurate results given undesirable inputs. Extensive experiments in our production environment verify that HEAL provides significantly better result accuracy and full-process interpretability than the SOTA baselines. With these advantages, HEAL successfully serves our data center and worldwide product operations and impressively contributes to many other workflows.<\/jats:p>","DOI":"10.1145\/3626785","type":"journal-article","created":{"date-parts":[[2023,12,12]],"date-time":"2023-12-12T15:20:29Z","timestamp":1702394429000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["HEAL: Performance Troubleshooting Deep inside Data Center Hosts"],"prefix":"10.1145","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4139-1477","authenticated-orcid":false,"given":"Yicheng","family":"Pan","sequence":"first","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-3965-2949","authenticated-orcid":false,"given":"Yang","family":"Zhang","sequence":"additional","affiliation":[{"name":"ByteDance Inc., Beijing , China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0366-0410","authenticated-orcid":false,"given":"Tingzhu","family":"Bi","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4714-2634","authenticated-orcid":false,"given":"Linlin","family":"Han","sequence":"additional","affiliation":[{"name":"ByteDance Inc., Beijing , China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4334-5159","authenticated-orcid":false,"given":"Yu","family":"Zhang","sequence":"additional","affiliation":[{"name":"ByteDance Inc., Beijing , China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1963-2513","authenticated-orcid":false,"given":"Meng","family":"Ma","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-9321-4442","authenticated-orcid":false,"given":"Xiangzhuang","family":"Shen","sequence":"additional","affiliation":[{"name":"ByteDance Inc., Beijing , China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1591-0480","authenticated-orcid":false,"given":"Xinrui","family":"Jiang","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-3981-8163","authenticated-orcid":false,"given":"Feng","family":"Wang","sequence":"additional","affiliation":[{"name":"ByteDance Inc., Beijing , China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7332-0449","authenticated-orcid":false,"given":"Xian","family":"Liu","sequence":"additional","affiliation":[{"name":"ByteDance Inc., Beijing , China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8854-2079","authenticated-orcid":false,"given":"Ping","family":"Wang","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2023,12,12]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"crossref","unstructured":"Nuha Alshuqayran Nour Ali and Roger Evans. 2016. A Systematic Mapping Study in Microservice Architecture. In SOCA. 44--51.","DOI":"10.1109\/SOCA.2016.15"},{"key":"e_1_2_1_2_1","unstructured":"Dan Ardelean Amer Diwan and Chandra Erdman. 2018. Performance Analysis of Cloud Applications. In NSDI. 405--417."},{"volume-title":"atop. https:\/\/github.com\/Atoptool\/atop (Accessed","year":"2023","key":"e_1_2_1_3_1","unstructured":"atop. 2023. atop. https:\/\/github.com\/Atoptool\/atop (Accessed: July 18, 2023)."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/MS.2016.64"},{"key":"e_1_2_1_5_1","doi-asserted-by":"crossref","unstructured":"Ali Basiri Lorin Hochstein Nora Jones and Haley Tucker. 2019. Automating chaos experiments in production. In ICSE (SEIP). 31--40.","DOI":"10.1109\/ICSE-SEIP.2019.00012"},{"key":"e_1_2_1_6_1","doi-asserted-by":"crossref","unstructured":"Pengfei Chen Yong Qi Pengfei Zheng and Di Hou. 2014. CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In INFOCOM. 1887--1895.","DOI":"10.1109\/INFOCOM.2014.6848128"},{"key":"e_1_2_1_7_1","volume-title":"Terence Kelly, and Julie Symons.","author":"Cohen Ira","year":"2004","unstructured":"Ira Cohen, Jeffrey S. Chase, Mois\u00e9 s Goldszmidt, Terence Kelly, and Julie Symons. 2004. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control. In OSDI. 231--244."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1002\/spe.4380211102"},{"key":"e_1_2_1_9_1","doi-asserted-by":"crossref","unstructured":"Yu Gan Mingyu Liang Sundar Dev David Lo and Christina Delimitrou. 2021. Sage: practical and scalable ML-driven performance debugging in microservices. In ASPLOS. 135--151.","DOI":"10.1145\/3445814.3446700"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304004"},{"key":"e_1_2_1_11_1","volume-title":"Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society","author":"Granger Clive WJ","year":"1969","unstructured":"Clive WJ Granger. 1969. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society (1969), 424--438."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1002\/for.3980030207"},{"key":"e_1_2_1_13_1","doi-asserted-by":"crossref","unstructured":"Mark Grechanik Chen Fu and Qing Xie. 2012. Automatically finding performance problems with feedback-directed learning software testing. In ICSE. 156--166.","DOI":"10.1109\/ICSE.2012.6227197"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2023.3241299"},{"key":"e_1_2_1_15_1","volume-title":"Eliazar","author":"Gunawi Haryadi S.","year":"2016","unstructured":"Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar. 2016. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. In SoCC. 1--16."},{"volume-title":"hwmon. https:\/\/docs.kernel.org\/hwmon (Accessed","year":"2023","key":"e_1_2_1_16_1","unstructured":"hwmon. 2023. hwmon. https:\/\/docs.kernel.org\/hwmon (Accessed: July 18, 2023)."},{"key":"e_1_2_1_17_1","volume-title":"JMLR Workshop Conf Proc","volume":"52","author":"Hyttinen Antti","year":"2016","unstructured":"Antti Hyttinen, Sergey M. Plis, Matti J\"a rvisalo, Frederick Eberhardt, and David Danks. 2016. Causal Discovery from Subsampled Time Series Data by Constraint Optimization. In JMLR Workshop Conf Proc, Vol. 52. 216--227."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3314048"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1592568.1592597"},{"key":"e_1_2_1_20_1","doi-asserted-by":"crossref","unstructured":"Myunghwan Kim Roshan Sumbaly and Sam Shah. 2013. Root cause detection in a service-oriented architecture. In SIGMETRICS. 93--104.","DOI":"10.1145\/2465529.2465753"},{"key":"e_1_2_1_21_1","volume-title":"Estimating mutual information. Physical review E","author":"Kraskov Alexander","year":"2004","unstructured":"Alexander Kraskov, Harald St\u00f6gbauer, and Peter Grassberger. 2004. Estimating mutual information. Physical review E, Vol. 69, 6 (2004), 066138."},{"key":"e_1_2_1_22_1","doi-asserted-by":"crossref","unstructured":"Jaewon Lee Changkyu Kim Kun Lin Liqun Cheng Rama Govindaraju and Jangwoo Kim. 2018. WSMeter: A Performance Evaluation Methodology for Google's Production Warehouse-Scale Computers. In ASPLOS. 549--563.","DOI":"10.1145\/3173162.3173196"},{"key":"e_1_2_1_23_1","doi-asserted-by":"crossref","unstructured":"Mingjie Li Zeyan Li Kanglin Yin Xiaohui Nie Wenchi Zhang Kaixin Sui and Dan Pei. 2022. Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition. In KDD. 3230--3240.","DOI":"10.1145\/3534678.3539041"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-03596-9_1"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSC.2020.2993251"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2021.3083671"},{"key":"e_1_2_1_27_1","doi-asserted-by":"crossref","unstructured":"Meng Ma Jingmin Xu Yuan Wang Pengfei Chen Zonghua Zhang and Ping Wang. 2020. AutoMAP: Diagnose Your Microservice-based Web Applications Automatically. In WWW. 246--258.","DOI":"10.1145\/3366423.3380111"},{"key":"e_1_2_1_28_1","doi-asserted-by":"crossref","unstructured":"Weibin Meng Ying Liu Yichen Zhu Shenglin Zhang Dan Pei Yuqing Liu Yihao Chen Ruizhi Zhang Shimin Tao Pei Sun and Rong Zhou. 2019. LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs. In IJCAI. 4739--4745.","DOI":"10.24963\/ijcai.2019\/658"},{"volume-title":"nux. https:\/\/github.com\/toolkits\/nux (Accessed","year":"2023","key":"e_1_2_1_29_1","unstructured":"nux. 2023. nux. https:\/\/github.com\/toolkits\/nux (Accessed: July 18, 2023)."},{"volume-title":"http:\/\/open-falcon.org (Accessed","year":"2023","key":"e_1_2_1_30_1","unstructured":"Open-Falcon. 2023. Open-Falcon. http:\/\/open-falcon.org (Accessed: July 18, 2023)."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/1710115.1710118"},{"key":"e_1_2_1_32_1","doi-asserted-by":"crossref","unstructured":"Yicheng Pan Meng Ma Xinrui Jiang and Ping Wang. 2021. Faster deeper easier: crowdsourcing diagnosis of microservice kernel failure from user space. In ISSTA. 646--657.","DOI":"10.1145\/3460319.3464805"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403306"},{"volume-title":"https:\/\/prometheus.io (Accessed","year":"2023","key":"e_1_2_1_34_1","unstructured":"Prometheus. 2023. Prometheus. https:\/\/prometheus.io (Accessed: July 18, 2023)."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1063\/1.5025050"},{"key":"e_1_2_1_36_1","volume-title":"AISTATS (Proceedings of Machine Learning Research","volume":"947","author":"Runge Jakob","year":"2018","unstructured":"Jakob Runge. 2018b. Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information. In AISTATS (Proceedings of Machine Learning Research, Vol. 84). 938--947."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.25080\/Majora-92bf1922-011"},{"key":"e_1_2_1_38_1","volume-title":"Annie Ibrahim Rana, and Giovani Estrada","author":"Sol\u00e9 Marc","year":"2017","unstructured":"Marc Sol\u00e9, Victor Munt\u00e9s-Mulero, Annie Ibrahim Rana, and Giovani Estrada. 2017. Survey on models and techniques for root-cause analysis. arXiv preprint arXiv:1701.08546 (2017)."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1177\/089443939100900106"},{"key":"e_1_2_1_40_1","volume-title":"Kahuna: Problem diagnosis for Mapreduce-based cloud computing environments. In NOMS. 112--119.","author":"Tan Jiaqi","year":"2010","unstructured":"Jiaqi Tan, Xinghao Pan, Eugene Marinelli, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. 2010. Kahuna: Problem diagnosis for Mapreduce-based cloud computing environments. In NOMS. 112--119."},{"key":"e_1_2_1_41_1","volume-title":"PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems. In ICDCS. 285--294.","author":"Tan Yongmin","year":"2012","unstructured":"Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Chitra Venkatramani, and Deepak Rajan. 2012. PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems. In ICDCS. 285--294."},{"key":"e_1_2_1_42_1","volume-title":"Pramod Bhatotia, Ruichuan Chen, Bimal Viswanath, Lei Jiao, and Christof Fetzer.","author":"Thalheim J\u00f6","year":"2017","unstructured":"J\u00f6 rg Thalheim, Antonio Rodrigues, Istemi Ekin Akkus, Pramod Bhatotia, Ruichuan Chen, Bimal Viswanath, Lei Jiao, and Christof Fetzer. 2017. Sieve: actionable insights from monitored metrics in distributed systems. In Middleware. 14--27."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/2553070.2553079"},{"key":"e_1_2_1_44_1","doi-asserted-by":"crossref","unstructured":"Chengwei Wang Krishnamurthy Viswanathan Choudur Lakshminarayan Vanish Talwar Wade Satterfield and Karsten Schwan. 2011. Statistical techniques for online anomaly detection in data centers. In Integrated Network Management. 385--392.","DOI":"10.1109\/INM.2011.5990537"},{"key":"e_1_2_1_45_1","doi-asserted-by":"crossref","unstructured":"Ping Wang Jingmin Xu Meng Ma Weilan Lin Disheng Pan Yuan Wang and Pengfei Chen. 2018. CloudRanger: Root Cause Identification for Cloud Native Systems. In CCGrid. 492--502.","DOI":"10.1109\/CCGRID.2018.00076"},{"key":"e_1_2_1_46_1","doi-asserted-by":"crossref","unstructured":"Kejiang Ye. 2017. Anomaly Detection in Clouds: Challenges and Practice. In ETCD@ASPLOS. 6:1--6:2.","DOI":"10.1145\/3129457.3129497"},{"key":"e_1_2_1_47_1","volume-title":"HALO: Hierarchy-aware Fault Localization for Cloud Systems. In KDD. 3948--3958.","author":"Zhang Xu","year":"2021","unstructured":"Xu Zhang, Chao Du, Yifan Li, Yong Xu, Hongyu Zhang, Si Qin, Ze Li, Qingwei Lin, Yingnong Dang, Andrew Zhou, Saravanakumar Rajmohan, and Dongmei Zhang. 2021. HALO: Hierarchy-aware Fault Localization for Cloud Systems. In KDD. 3948--3958."},{"key":"e_1_2_1_48_1","doi-asserted-by":"crossref","unstructured":"Nengwen Zhao Honglin Wang Zeyan Li Xiao Peng Gang Wang Zhu Pan Yong Wu Zhen Feng Xidao Wen Wenchi Zhang Kaixin Sui and Dan Pei. 2021. An empirical investigation of practical log anomaly detection for online service systems. In ESEC\/SIGSOFT FSE. 1404--1415.","DOI":"10.1145\/3468264.3473933"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2018.2887384"},{"key":"e_1_2_1_50_1","doi-asserted-by":"crossref","unstructured":"Xiang Zhou Xin Peng Tao Xie Jun Sun Chao Ji Dewei Liu Qilin Xiang and Chuan He. 2019. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In ESEC\/SIGSOFT FSE. 683--694.","DOI":"10.1145\/3338906.3338961"}],"container-title":["Proceedings of the ACM on Measurement and Analysis of Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3626785","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3626785","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,23]],"date-time":"2025-08-23T00:15:35Z","timestamp":1755908135000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3626785"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,7]]},"references-count":50,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,12,7]]}},"alternative-id":["10.1145\/3626785"],"URL":"https:\/\/doi.org\/10.1145\/3626785","relation":{},"ISSN":["2476-1249"],"issn-type":[{"type":"electronic","value":"2476-1249"}],"subject":[],"published":{"date-parts":[[2023,12,7]]},"assertion":[{"value":"2023-12-12","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}