{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,11]],"date-time":"2026-04-11T13:08:52Z","timestamp":1775912932133,"version":"3.50.1"},"reference-count":147,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2025,12,31]]},"abstract":"<jats:p>Root cause localization is the process of monitoring system behavior and analyzing fault patterns from behavioral data. It is applicable in software development, network operations, and cloud computing. However, with the advent of microservice architectures and cloud-native technologies, root cause localization becomes an arduous task. Frequent updates in systems result in large-scale data and complex dependencies. Traditional analysis methods relying on manual experience and predefined rules have limited data processing and cannot learn new fault patterns from historical knowledge. Artificial Intelligence techniques have emerged as powerful tools to leverage historical knowledge and are now widely used in root cause localization. In this article, we provide a structured overview and a qualitative analysis of root cause localization in microservice systems. To begin with, we review the literature in this area and abstract a workflow of root cause localization, including multimodal data collection, intelligent root cause analysis, and performance evaluation. In particular, we highlight the role played by Artificial Intelligence techniques. Finally, we discuss some open challenges and research directions and propose an end-to-end framework from a new perspective, providing insights for future works.<\/jats:p>","DOI":"10.1145\/3736755","type":"journal-article","created":{"date-parts":[[2025,5,22]],"date-time":"2025-05-22T07:17:51Z","timestamp":1747898271000},"page":"1-37","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Intelligent Root Cause Localization in MicroService Systems: A Survey and New Perspectives"],"prefix":"10.1145","volume":"57","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-1456-817X","authenticated-orcid":false,"given":"Nan","family":"Fu","sequence":"first","affiliation":[{"name":"Southeast University","place":["Nanjing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8642-4362","authenticated-orcid":false,"given":"Guang","family":"Cheng","sequence":"additional","affiliation":[{"name":"Southeast University","place":["Nanjing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-6716-0520","authenticated-orcid":false,"given":"Yue","family":"Teng","sequence":"additional","affiliation":[{"name":"Southeast University","place":["Nanjing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-1582-8200","authenticated-orcid":false,"given":"Guangye","family":"Dai","sequence":"additional","affiliation":[{"name":"Southeast University","place":["Nanjing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4485-6743","authenticated-orcid":false,"given":"Shui","family":"Yu","sequence":"additional","affiliation":[{"name":"University of Technology Sydney","place":["Sydney, Australia"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5871-7766","authenticated-orcid":false,"given":"Zihan","family":"Chen","sequence":"additional","affiliation":[{"name":"Southeast University","place":["Nanjing, China"]}]}],"member":"320","published-online":{"date-parts":[[2025,7,14]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Cloud Native Computing Foundation. 2020. CNCF Cloud Native Interactive Landscape. CNCF Cloud Native Interactive Landscape. Retrieved June 4 2025 from https:\/\/landscape.cncf.io\/"},{"key":"e_1_3_1_3_2","unstructured":"Cloud Native Computing Foundation. 2022. What is continuous profiling? CNCF. Retrieved June 4 2025 from https:\/\/www.cncf.io\/blog\/2022\/05\/31\/what-is-continuous-profiling\/"},{"key":"e_1_3_1_4_2","unstructured":"Nightingale. 2024. Nightingale. Nightingale. Retrieved June 4 2025 from https:\/\/n9e.github.io\/"},{"key":"e_1_3_1_5_2","unstructured":"Netflix. 2022. GitHub - Netflix\/mantis: A platform that makes it easy for developers to build realtime cost-effective operations-focused applications. GitHub. Retrieved June 4 2025 from https:\/\/github.com\/netflix\/mantis"},{"key":"e_1_3_1_6_2","first-page":"137","volume-title":"International Conference on Service-Oriented Computing","author":"Aggarwal Pooja","year":"2020","unstructured":"Pooja Aggarwal, Ajay Gupta, Prateeti Mohapatra, Seema Nagar, Atri Mandal, Qing Wang, and Amit Paradkar. 2020. Localization of operational faults in cloud applications by mining causal dependencies in logs using golden signals. In International Conference on Service-Oriented Computing. Springer, 137\u2013149."},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNET.2017.2761758"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/3430984.3431027"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3651890.3672254"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/775047.775109"},{"key":"e_1_3_1_11_2","unstructured":"Azamikram. 2022. GitHub - azamikram\/rcd: Root Cause Discovery: Root Cause Analysis of Failures in Microservices through Causal Discovery. GitHub. Retrieved June 4 2025 from https:\/\/github.com\/azamikram\/rcd"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/1282427.1282383"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.354"},{"key":"e_1_3_1_14_2","article-title":"Building Netflix\u2019s Distributed Tracing Infrastructure","author":"Blog Netflix Technology","year":"2020","unstructured":"Netflix Technology Blog. 2020. Building Netflix\u2019s Distributed Tracing Infrastructure. Retrieved from https:\/\/netflixtechblog.com\/building-netflixs-distributed-tracing-infrastructure-bb856c319304. Accessed Apr. 4, 2025.","journal-title":"https:\/\/netflixtechblog.com\/building-netflixs-distributed-tracing-infrastructure-bb856c319304"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jss.2019.110432"},{"key":"e_1_3_1_16_2","first-page":"1666","volume-title":"International Conference on Artificial Intelligence and Statistics","author":"Budhathoki Kailash","year":"2021","unstructured":"Kailash Budhathoki, Dominik Janzing, Patrick Bloebaum, and Hoiyi Ng. 2021. Why did the distribution change?. In International Conference on Artificial Intelligence and Statistics. PMLR, 1666\u20131674."},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/IWQOS52092.2021.9521318"},{"key":"e_1_3_1_18_2","unstructured":"Chaos-mesh. 2025. GitHub - chaos-mesh\/chaos-mesh: A Chaos Engineering Platform for Kubernetes. GitHub. Retrieved June 4 2025 from https:\/\/github.com\/chaos-mesh\/chaos-mesh"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/COMPSAC51774.2021.00121"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSC.2016.2607739"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM.2014.6848128"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/DSA52907.2021.00018"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.infsof.2022.107083"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3368089.3417055"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3510003.3510085"},{"key":"e_1_3_1_26_2","unstructured":"cncf. 2022. tag-observability\/whitepaper.md at main \u00b7 cncf\/tag-observability. GitHub. Retrieved June 4 2025 from https:\/\/github.com\/cncf\/tag-observability\/blob\/main\/whitepaper.md"},{"key":"e_1_3_1_27_2","unstructured":"Colin Ian King. 2023. stress-ng (stress next generation). GitHub. Retrieved June 4 2025 from https:\/\/github.com\/ColinIanKing\/stress-ng"},{"key":"e_1_3_1_28_2","unstructured":"Wikipedia Contributors. 2019. Occam\u2019s razor. Wikipedia. Retrieved June 4 2025 from https:\/\/en.wikipedia.org\/wiki\/Occam%27s_razor"},{"key":"e_1_3_1_29_2","article-title":"Observability: Principles, Challenges, Capabilities & Practices\u2014Coralogix","year":"2024","unstructured":"Coralogix. 2024. Observability: Principles, Challenges, Capabilities & Practices\u2014Coralogix. Retrieved from https:\/\/coralogix.com\/guides\/observability\/. Accessed Apr. 4, 2025.","journal-title":"https:\/\/coralogix.com\/guides\/observability\/"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/SoSE50414.2020.9130526"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE-Companion.2019.00023"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3329859.3329878"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3611643.3613864"},{"key":"e_1_3_1_34_2","first-page":"601","article-title":"Revelio: Ml-generated debugging queries for finding root causes in distributed systems","volume":"4","author":"Dogga Pradeep","year":"2022","unstructured":"Pradeep Dogga, Karthik Narasimhan, Anirudh Sivaraman, Shiv Saini, George Varghese, and Ravi Netravali. 2022. Revelio: Ml-generated debugging queries for finding root causes in distributed systems. Proceedings of Machine Learning and Systems 4 (2022), 601\u2013622.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3133956.3134015"},{"key":"e_1_3_1_36_2","unstructured":"eBPF. 2025. eBPF - Introduction Tutorials & Community Resources. ebpf.io. Retrieved June 4 2025 from https:\/\/ebpf.io"},{"key":"e_1_3_1_37_2","unstructured":"Elasticsearch B.V. 2025. Logstash: Collect Parse Transform Logs. Elastic. Retrieved June 4 2025 from https:\/\/www.elastic.co\/logstash"},{"key":"e_1_3_1_38_2","unstructured":"The Apache Software Foundation. 2025. Apache SkyWalking. skywalking.apache.org. Retrieved June 4 2025 from https:\/\/skywalking.apache.org\/"},{"key":"e_1_3_1_39_2","unstructured":"FudanSELab. 2019. GitHub - FudanSELab\/Research-ESEC-FSE2019-AIOPS. GitHub. Retrieved June 4 2025 from https:\/\/github.com\/FudanSELab\/Research-ESEC-FSE2019-AIOPS"},{"key":"e_1_3_1_40_2","unstructured":"FudanSELab. 2022. GitHub - FudanSELab\/train-ticket: Train Ticket - A Benchmark Microservice System. GitHub. Retrieved June 4 2025 from https:\/\/github.com\/FudanSELab\/train-ticket"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304013"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304004"},{"key":"e_1_3_1_43_2","article-title":"cadvisor","year":"2024","unstructured":"Google. 2024. cadvisor. GitHub, Retrieved from https:\/\/github.com\/google\/cadvisor. Accessed Apr. 16, 2024.","journal-title":"GitHub, Retrieved from https:\/\/github.com\/google\/cadvisor"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3368089.3409741"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-17642-6_45"},{"issue":"4","key":"e_1_3_1_46_2","first-page":"1","article-title":"A survey of learning causality with data: Problems and methods","volume":"53","author":"Guo Ruocheng","year":"2020","unstructured":"Ruocheng Guo, Lu Cheng, Jundong Li, P Richard Hahn, and Huan Liu. 2020. A survey of learning causality with data: Problems and methods. ACM Computing Surveys (CSUR) 53, 4 (2020), 1\u201337.","journal-title":"ACM Computing Surveys (CSUR)"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3368089.3417066"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3595289"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3603269.3604877"},{"key":"e_1_3_1_50_2","article-title":"Continuous Profiling: A New Observability Signal","author":"Horovits Dotan","year":"2022","unstructured":"Dotan Horovits. 2022. Continuous Profiling: A New Observability Signal. Retrieved from https:\/\/logz.io\/blog\/continuous-profiling-new-observability-signal-in-opentelemetry\/. Accessed Apr. 4, 2025.","journal-title":"https:\/\/logz.io\/blog\/continuous-profiling-new-observability-signal-in-opentelemetry\/"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1145\/3102980.3103005"},{"key":"e_1_3_1_52_2","article-title":"tc(8) - Linux Man Page","author":"Hubert Bert","year":"2023","unstructured":"Bert Hubert. 2023. tc(8) - Linux Man Page. die.net, Retrieved from https:\/\/linux.die.net\/man\/8\/tc. Accessed Apr. 15, 2024.","journal-title":"die.net, Retrieved from https:\/\/linux.die.net\/man\/8\/tc"},{"key":"e_1_3_1_53_2","first-page":"31158","article-title":"Root cause analysis of failures in microservices through causal discovery","volume":"35","author":"Ikram Azam","year":"2022","unstructured":"Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. 2022. Root cause analysis of failures in microservices through causal discovery. Advances in Neural Information Processing Systems 35 (2022), 31158\u201331170.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_54_2","unstructured":"The Istio Authors. 2025. Istio. Istio. Retrieved June 4 2025 from https:\/\/istio.io\/"},{"key":"e_1_3_1_55_2","unstructured":"The Jaeger Authors. 2025. Jaeger: open source distributed tracing platform. www.jaegertracing.io. Retrieved June 4 2025 from https:\/\/www.jaegertracing.io"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/775152.775191"},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/1080173.1080178"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/1592568.1592597"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1145\/2494232.2465753"},{"key":"e_1_3_1_60_2","article-title":"Characterization and learning of causal graphs with latent variables from soft interventions","volume":"32","author":"Kocaoglu Murat","year":"2019","unstructured":"Murat Kocaoglu, Amin Jaber, Karthikeyan Shanmugam, and Elias Bareinboim. 2019. Characterization and learning of causal graphs with latent variables from soft interventions. Advances in Neural Information Processing Systems 32 (2019).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_61_2","unstructured":"Grafana Labs. 2019. Grafana - The open platform for analytics and monitoring. Grafana Labs. Retrieved June 4 2025 from https:\/\/grafana.com\/"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10664-021-10063-9"},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1145\/3534678.3539041"},{"key":"e_1_3_1_64_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jss.2023.111748"},{"key":"e_1_3_1_65_2","first-page":"1","volume-title":"2021 IEEE\/ACM 29th International Symposium on Quality of Service (IWQOS\u201921)","year":"2021","unstructured":"Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, Zhekang Chen, Wenchi Zhang, Xiaohui Nie, Kaixin Sui, and Dan Pei. 2021. Practical root cause localization for microservice systems via trace analysis. In 2021 IEEE\/ACM 29th International Symposium on Quality of Service (IWQOS\u201921). IEEE, 1\u201310."},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1145\/3540250.3549092"},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1145\/3379484"},{"key":"e_1_3_1_68_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-03596-9_1"},{"key":"e_1_3_1_69_2","doi-asserted-by":"publisher","DOI":"10.1145\/2889160.2889232"},{"key":"e_1_3_1_70_2","doi-asserted-by":"publisher","DOI":"10.1109\/PCCC.2018.8711092"},{"key":"e_1_3_1_71_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE-SEIP52600.2021.00043"},{"key":"e_1_3_1_72_2","doi-asserted-by":"publisher","DOI":"10.1145\/2815675.2815679"},{"key":"e_1_3_1_73_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISSRE.2019.00014"},{"key":"e_1_3_1_74_2","first-page":"48","volume-title":"2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE\u201920)","year":"2020","unstructured":"Ping Liu, Haowen Xu, Qianyu Ouyang, Rui Jiao, Zhekang Chen, Shenglin Zhang, Jiahai Yang, Linlin Mo, Jice Zeng, Wenman Xue, and Dan Pei. 2020. Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE\u201920). IEEE, 48\u201358."},{"key":"e_1_3_1_75_2","doi-asserted-by":"publisher","DOI":"10.1109\/VAST.2010.5652910"},{"key":"e_1_3_1_76_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPADS47876.2019.00016"},{"key":"e_1_3_1_77_2","doi-asserted-by":"publisher","DOI":"10.1145\/3472883.3487003"},{"key":"e_1_3_1_78_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICWS.2019.00022"},{"key":"e_1_3_1_79_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSC.2020.2993251"},{"key":"e_1_3_1_80_2","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2021.3083671"},{"key":"e_1_3_1_81_2","doi-asserted-by":"publisher","DOI":"10.1145\/3366423.3380111"},{"key":"e_1_3_1_82_2","volume-title":"Observability Engineering","author":"Majors Charity","year":"2022","unstructured":"Charity Majors, Liz Fong-Jones, and George Miranda. 2022. Observability Engineering. \u201cO\u2019Reilly Media, Inc.\u201d."},{"key":"e_1_3_1_83_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICST.2018.00034"},{"key":"e_1_3_1_84_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jss.2019.110464"},{"key":"e_1_3_1_85_2","doi-asserted-by":"publisher","DOI":"10.1109\/CLOUD55607.2022.00026"},{"key":"e_1_3_1_86_2","doi-asserted-by":"publisher","DOI":"10.1109\/IWQoS49365.2020.9213058"},{"key":"e_1_3_1_87_2","doi-asserted-by":"publisher","DOI":"10.1109\/tpds.2013.21"},{"key":"e_1_3_1_88_2","unstructured":"Microservices Demo Authors. 2021. microservices-demo\/microservices-demo. GitHub. Retrieved June 4 2025 from https:\/\/github.com\/microservices-demo\/microservices-demo"},{"key":"e_1_3_1_89_2","doi-asserted-by":"publisher","DOI":"10.1109\/STC55697.2022.00033"},{"key":"e_1_3_1_90_2","unstructured":"Netflix. 2024. GitHub - Netflix\/atlas: In-memory dimensional time series database. GitHub. Retrieved June 4 2025 from https:\/\/github.com\/Netflix\/atlas"},{"key":"e_1_3_1_91_2","unstructured":"Netflix. 2025. GitHub - Netflix\/spectator: Client library for collecting metrics. GitHub. Retrieved June 4 2025 from https:\/\/github.com\/Netflix\/spectator"},{"key":"e_1_3_1_92_2","unstructured":"Spring Cloud Netflix. 2025. 10. Metrics: Spectator Servo and Atlas. Spring.io. Retrieved June 4 2025 from https:\/\/cloud.spring.io\/spring-cloud-netflix\/multi\/multi_netflix-metrics.html"},{"key":"e_1_3_1_93_2","unstructured":"NetManAIOps. 2022. GitHub - NetManAIOps\/DejaVu: Code and datasets for FSE.22 paper \u201cActionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems.\u201d GitHub. Retrieved June 4 2025 from https:\/\/github.com\/NetManAIOps\/DejaVu"},{"key":"e_1_3_1_94_2","unstructured":"NetManAIOps. 2022. GitHub - NetManAIOps\/PSqueeze. GitHub. Retrieved June 4 2025 from https:\/\/github.com\/NetManAIOps\/PSqueeze"},{"key":"e_1_3_1_95_2","unstructured":"NetManAIOps. 2021. GitHub - NetManAIOps\/TraceRCA: Practical Root Cause Localization for Microservice Systems via Trace Analysis. IWQoS 2021. GitHub. Retrieved June 4 2025 from https:\/\/github.com\/NetManAIOps\/TraceRCA"},{"key":"e_1_3_1_96_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDCS.2013.26"},{"key":"e_1_3_1_97_2","doi-asserted-by":"publisher","DOI":"10.1145\/2038633.2038634"},{"key":"e_1_3_1_98_2","doi-asserted-by":"publisher","DOI":"10.1145\/3483424"},{"key":"e_1_3_1_99_2","unstructured":"OpenTelemetry Authors. 2025. What is OpenTelemetry? OpenTelemetry. Retrieved June 4 2025 from https:\/\/opentelemetry.io\/docs\/what-is-opentelemetry\/"},{"key":"e_1_3_1_100_2","first-page":"27730","article-title":"Training language models to follow instructions with human feedback","volume":"35","year":"2022","unstructured":"Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730\u201327744.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_101_2","doi-asserted-by":"publisher","DOI":"10.1145\/3460319.3464805"},{"key":"e_1_3_1_102_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2005.159"},{"issue":"2","key":"e_1_3_1_103_2","first-page":"503","article-title":"Failure diagnosis for distributed systems using targeted fault injection","volume":"28","author":"Pham Cuong","year":"2016","unstructured":"Cuong Pham, Long Wang, Byung Chul Tak, Salman Baset, Chunqiang Tang, Zbigniew Kalbarczyk, and Ravishankar K Iyer. 2016. Failure diagnosis for distributed systems using targeted fault injection. IEEE Transactions on Parallel and Distributed Systems 28, 2 (2016), 503\u2013516.","journal-title":"IEEE Transactions on Parallel and Distributed Systems"},{"key":"e_1_3_1_104_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-981-10-8603-8_1"},{"key":"e_1_3_1_105_2","unstructured":"Fluentd Project. 2025. What is Fluentd? | Fluentd. www.fluentd.org. Retrieved June 4 2025 from https:\/\/www.fluentd.org\/architecture"},{"key":"e_1_3_1_106_2","unstructured":"The Prometheus Authors. 2021. prometheus\/node_exporter. GitHub. Retrieved June 4 2025 from https:\/\/github.com\/prometheus\/node_exporte"},{"key":"e_1_3_1_107_2","unstructured":"The Prometheus Authors. 2025. Prometheus - Monitoring system & time series database. prometheus.io. Retrieved June 4 2025 from https:\/\/prometheus.io"},{"key":"e_1_3_1_108_2","unstructured":"Pyroscope. 2022. Open Source Continuous Profiling Platform. Pyroscope.io. Retrieved June 4 2025 from https:\/\/pyroscope.io\/"},{"key":"e_1_3_1_109_2","doi-asserted-by":"publisher","DOI":"10.3390\/app10062166"},{"key":"e_1_3_1_110_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2010.68"},{"key":"e_1_3_1_111_2","doi-asserted-by":"publisher","DOI":"10.1109\/IWQoS57198.2023.10188728"},{"key":"e_1_3_1_112_2","doi-asserted-by":"publisher","DOI":"10.1109\/FiCloud.2019.00036"},{"key":"e_1_3_1_113_2","article-title":"The 3 Pillars of System Observability: Logs, Metrics, and Tracing","year":"2020","unstructured":"Samuel. 2020. The 3 Pillars of System Observability: Logs, Metrics, and Tracing. Retrieved from https:\/\/iamondemand.com\/blog\/the-3-pillars-of-system-observability-logs-metrics-and-tracing\/. Accessed Apr. 4, 2025.","journal-title":"https:\/\/iamondemand.com\/blog\/the-3-pillars-of-system-observability-logs-metrics-and-tracing\/"},{"key":"e_1_3_1_114_2","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313653"},{"key":"e_1_3_1_115_2","doi-asserted-by":"publisher","DOI":"10.1145\/3603269.3604823"},{"key":"e_1_3_1_116_2","unstructured":"Benjamin H. Sigelman Luiz Andr\u00e9 Barroso Mike Burrows Pat Stephenson Manoj Plakal Donald Beaver Saul Jaspan and Chandan Shanbhag. 2010. Dapper a large-scale distributed systems tracing infrastructure. (2010)."},{"key":"e_1_3_1_117_2","unstructured":"SolarWinds Worldwide LLC. 2025. What is Observability?\u2014IT Glossary | SolarWinds. www.solarwinds.com. Retrieved June 4 2025 from https:\/\/www.solarwinds.com\/resources\/it-glossary\/observability"},{"key":"e_1_3_1_118_2","unstructured":"SolarWinds Worldwide LLC. 2025. Log Monitoring | Loggly. Log Analysis | Log Monitoring by Loggly. Retrieved June 4 2025 from https:\/\/www.loggly.com\/product\/log-monitoring\/"},{"key":"e_1_3_1_119_2","doi-asserted-by":"publisher","DOI":"10.1145\/3501297"},{"key":"e_1_3_1_120_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jss.2018.09.082"},{"key":"e_1_3_1_121_2","unstructured":"Marc Sol\u00e9 Victor Munt\u00e9s-Mulero Annie Ibrahim Rana and Giovani Estrada. 2017. Survey on models and techniques for root-cause analysis. arXiv:1701.08546. Retrieved from https:\/\/arxiv.org\/abs\/1701.08546"},{"key":"e_1_3_1_122_2","volume-title":"Causation, Prediction, and Search","author":"Spirtes Peter","year":"2000","unstructured":"Peter Spirtes, Clark N. Glymour, and Richard Scheines. 2000. Causation, Prediction, and Search. MIT Press."},{"key":"e_1_3_1_123_2","unstructured":"Splunk LLC. 2025. Splunk Products. Splunk. Retrieved June 4 2025 from https:\/\/www.splunk.com\/en_us\/products.html"},{"key":"e_1_3_1_124_2","unstructured":"The strace developers. 2024. strace\/strace. GitHub. Retrieved June 4 2025 from https:\/\/github.com\/strace\/strace"},{"key":"e_1_3_1_125_2","unstructured":"Netflix. 2022. Cassandra. Netflix TechBlog. Retrieved June 4 2025 from https:\/\/netflixtechblog.com\/tagged\/cassandra"},{"key":"e_1_3_1_126_2","doi-asserted-by":"publisher","DOI":"10.1145\/3135974.3135977"},{"key":"e_1_3_1_127_2","first-page":"37","article-title":"GigaOm radar for cloud observability","author":"Thurai A.","year":"2021","unstructured":"A. Thurai and S. D. Linthicum. 2021. GigaOm radar for cloud observability. GigaOm, Santa Barbara, CA, USA, Tech. Rep (2021), 37.","journal-title":"GigaOm, Santa Barbara, CA, USA, Tech. Rep"},{"key":"e_1_3_1_128_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2022.3193102"},{"key":"e_1_3_1_129_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASE51524.2021.9678708"},{"key":"e_1_3_1_130_2","doi-asserted-by":"publisher","DOI":"10.1109\/IWQoS.2015.7404741"},{"key":"e_1_3_1_131_2","doi-asserted-by":"publisher","DOI":"10.1145\/3580305.3599934"},{"key":"e_1_3_1_132_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICWS49710.2020.00026"},{"key":"e_1_3_1_133_2","doi-asserted-by":"publisher","DOI":"10.1109\/CCGRID.2018.00076"},{"key":"e_1_3_1_134_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNET.2018.2843805"},{"key":"e_1_3_1_135_2","first-page":"85","volume-title":"International Conference on Service-Oriented Computing","author":"Wu Li","year":"2020","unstructured":"Li Wu, Jasmin Bogatinovski, Sasho Nedelkoski, Johan Tordsson, and Odej Kao. 2020. Performance diagnosis in cloud microservices using deep learning. In International Conference on Service-Oriented Computing. Springer, 85\u201396."},{"key":"e_1_3_1_136_2","doi-asserted-by":"publisher","DOI":"10.1109\/CloudIntelligence52565.2021.00015"},{"key":"e_1_3_1_137_2","doi-asserted-by":"publisher","DOI":"10.1109\/NOMS47738.2020.9110353"},{"key":"e_1_3_1_138_2","doi-asserted-by":"publisher","DOI":"10.23919\/APNOMS56106.2022.9919941"},{"key":"e_1_3_1_139_2","first-page":"416","volume-title":"2021 IEEE\/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid\u201921)","author":"Ye Zihao","year":"2021","unstructured":"Zihao Ye, Pengfei Chen, and Guangba Yu. 2021. T-rank: A lightweight spectrum based fault localization approach for microservice systems. In 2021 IEEE\/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid\u201921). IEEE, 416\u2013425."},{"key":"e_1_3_1_140_2","doi-asserted-by":"publisher","DOI":"10.1145\/3442381.3449905"},{"key":"e_1_3_1_141_2","unstructured":"Zabbix LLC. 2018. Zabbix - The Enterprise-Class Open Source Network Monitoring Solution. Zabbix.com. Retrieved June 4 2025 from https:\/\/www.zabbix.com"},{"key":"e_1_3_1_142_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSA.2019.00014"},{"key":"e_1_3_1_143_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCCBDA55098.2022.9778893"},{"key":"e_1_3_1_144_2","doi-asserted-by":"publisher","DOI":"10.1145\/3459637.3481903"},{"key":"e_1_3_1_145_2","doi-asserted-by":"publisher","DOI":"10.18293\/SEKE2021-091"},{"key":"e_1_3_1_146_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2018.2887384"},{"key":"e_1_3_1_147_2","doi-asserted-by":"publisher","DOI":"10.1145\/3338906.3338961"},{"key":"e_1_3_1_148_2","unstructured":"The Zipkin Authors. 2025. OpenZipkin \u00b7 A distributed tracing system. Zipkin.io. Retrieved June 4 2025 from https:\/\/zipkin.io"}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3736755","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,14]],"date-time":"2025-07-14T13:42:56Z","timestamp":1752500576000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3736755"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,14]]},"references-count":147,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2025,12,31]]}},"alternative-id":["10.1145\/3736755"],"URL":"https:\/\/doi.org\/10.1145\/3736755","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,14]]},"assertion":[{"value":"2024-04-29","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-04-29","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-14","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}