{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T02:07:29Z","timestamp":1776996449461,"version":"3.51.4"},"reference-count":34,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2020,6,9]],"date-time":"2020-06-09T00:00:00Z","timestamp":1591660800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Meas. Anal. Comput. Syst."],"published-print":{"date-parts":[[2020,6,9]]},"abstract":"<jats:p>Root cause analysis in a large-scale production environment is challenging due to the complexity of the services running across global data centers. Due to the distributed nature of a large-scale system, the various hardware, software, and tooling logs are often maintained separately, making it difficult to review the logs jointly for understanding production issues. Another challenge in reviewing the logs for identifying issues is the scale - there could easily be millions of entities, each described by hundreds of features. In this paper we present a fast dimensional analysis framework that automates the root cause analysis on structured logs with improved scalability. We first explore item-sets, i.e. combinations of feature values, that could identify groups of samples with sufficient support for the target failures using the Apriori algorithm and a subsequent improvement, FP-Growth. These algorithms were designed for frequent item-set mining and association rule learning over transactional databases. After applying them on structured logs, we select the item-sets that are most unique to the target failures based on lift. We propose pre-processing steps with the use of a large-scale real-time database and post-processing techniques and parallelism to further speed up the analysis and improve interpretability, and demonstrate that such optimization is necessary for handling large-scale production datasets. We have successfully rolled out this approach for root cause investigation purposes within Facebook's infrastructure. We also present the setup and results from multiple production use cases in this paper.<\/jats:p>","DOI":"10.1145\/3392149","type":"journal-article","created":{"date-parts":[[2020,6,9]],"date-time":"2020-06-09T22:10:12Z","timestamp":1591740612000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":28,"title":["Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment"],"prefix":"10.1145","volume":"4","author":[{"given":"Fred","family":"Lin","sequence":"first","affiliation":[{"name":"Facebook Inc., Menlo Park, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Keyur","family":"Muzumdar","sequence":"additional","affiliation":[{"name":"Facebook Inc., Menlo Park, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nikolay Pavlovich","family":"Laptev","sequence":"additional","affiliation":[{"name":"Facebook Inc., Menlo Park, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mihai-Valentin","family":"Curelea","sequence":"additional","affiliation":[{"name":"Facebook Inc., Dublin, Ireland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Seunghak","family":"Lee","sequence":"additional","affiliation":[{"name":"Facebook Inc., Menlo Park, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sriram","family":"Sankar","sequence":"additional","affiliation":[{"name":"Facebook Inc., Menlo Park, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2020,6,12]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536222.2536231"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/170035.170072"},{"key":"e_1_2_1_3_1","volume-title":"Enhancing Spam Detection on Mobile Phone Short Message Service (SMS) Performance Using FP-Growth and Naive Bayes Classifier. In IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob) .","author":"Arifin Dea Delvia","year":"2016"},{"key":"e_1_2_1_4_1","volume":"199","author":"Bay Stephen D.","journal-title":"Michael J. Pazzani."},{"key":"e_1_2_1_5_1","volume-title":"Pazzani","author":"Bay Stephen D.","year":"2001"},{"key":"e_1_2_1_6_1","unstructured":"Ran M. Bittmann Philippe Nemery Xingtian Shi Michael Kemelmakher and Mengjiao Wang. 2018. Frequent Item-set Mining without Ubiquitous Items. In arXiv:1803.11105 [cs.DS] .  Ran M. Bittmann Philippe Nemery Xingtian Shi Michael Kemelmakher and Mengjiao Wang. 2018. Frequent Item-set Mining without Ubiquitous Items. In arXiv:1803.11105 [cs.DS] ."},{"key":"e_1_2_1_7_1","volume-title":"Latent Dirichlet Allocation. Journal of machine Learning research","author":"Blei David M","year":"2003"},{"key":"e_1_2_1_8_1","unstructured":"Dhruba Borthakur. 2019. HDFS Architecture Guide. https:\/\/hadoop.apache.org\/docs\/r1.2.1\/hdfs_design.html  Dhruba Borthakur. 2019. HDFS Architecture Guide. https:\/\/hadoop.apache.org\/docs\/r1.2.1\/hdfs_design.html"},{"key":"e_1_2_1_9_1","volume-title":"Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing. In USENIX Symposium on Operating Systems Design and Implementation .","author":"Boutin Eric","year":"2014"},{"key":"e_1_2_1_10_1","volume-title":"Dynamic Itemset Counting and Implication Rules for Market Basket Data. In ACM SIGMOD International Conference on Management of Data .","author":"Brin Sergey","year":"1997"},{"key":"e_1_2_1_11_1","volume-title":"Automatically Analyzing Groups of Crashes for Finding Correlations. In ESEC\/FSE Joint Meeting on Foundations of Software Engineering .","author":"Castelluccio Marco","year":"2017"},{"key":"e_1_2_1_12_1","doi-asserted-by":"crossref","unstructured":"Albert Greenberg James Hamilton David A. Maltz and Parveen Patel. 2009. The Cost of a Cloud: Research Problems in Data Center Networks. In ACM SIGCOMM Computer Communication Review .  Albert Greenberg James Hamilton David A. Maltz and Parveen Patel. 2009. The Cost of a Cloud: Research Problems in Data Center Networks. In ACM SIGCOMM Computer Communication Review .","DOI":"10.1145\/1496091.1496103"},{"key":"e_1_2_1_13_1","volume-title":"Mining Frequent Patterns Without Candidate Generation. In ACM SIGMOD International Conference on Management of Data .","author":"Han Jiawei","year":"2000"},{"key":"e_1_2_1_14_1","volume-title":"Digital Design and Computer Architecture","author":"Harris David"},{"key":"e_1_2_1_15_1","volume-title":"Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In USENIX conference on Networked systems design and implementation .","author":"Hindman Benjamin","year":"2011"},{"key":"e_1_2_1_16_1","volume-title":"Autopilot: Automatic Data Center Management. In ACM SIGOPS Operating System Review .","author":"Isard Michael","year":"2007"},{"key":"e_1_2_1_17_1","volume-title":"Complexity Analysis of Depth First and FP-growth Implementations of APRIORI. In International Conference on Machine Learning and Data Mining in Pattern Recognition .","author":"Kosters Walter A.","year":"2003"},{"key":"e_1_2_1_18_1","volume-title":"Hardware Remediation At Scale. In IEEE\/IFIP International Conference on Dependable Systems and Networks Workshops .","author":"Lin Fan","year":"2018"},{"key":"e_1_2_1_19_1","volume-title":"Spark-Based Rare Association Rule Mining for Big Datasets. In IEEE International Conference on Big Data (Big Data) .","author":"Liu Ruilin","year":"2016"},{"key":"e_1_2_1_20_1","unstructured":"MySQL. 2019. MySQL Customer: Facebook. https:\/\/www.mysql.com\/customers\/view\/?id=757  MySQL. 2019. MySQL Customer: Facebook. https:\/\/www.mysql.com\/customers\/view\/?id=757"},{"key":"e_1_2_1_21_1","volume-title":"Root Cause Analysis with Enriched Process Logs. In International Conference on Business Process Management","volume":"132","author":"Suriadi Suriadi"},{"key":"e_1_2_1_22_1","volume-title":"Hive - A Petabyte Scale Data Warehouse Using Hadoop. In IEEE International Conference on Data Engineering (ICDE) .","author":"Thusoo Ashish","year":"2010"},{"key":"e_1_2_1_23_1","volume-title":"Presto: Interacting with Petabytes of Data at Facebook. https:\/\/www.facebook.com\/notes\/facebook-engineering\/presto-interacting-with-petabytes-of-data-at-facebook\/10151786197628920\/","author":"Traverso Martin","year":"2013"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2523616.2523633"},{"key":"e_1_2_1_25_1","volume":"201","author":"Verma A.","journal-title":"J. Wilkes."},{"key":"e_1_2_1_26_1","doi-asserted-by":"crossref","unstructured":"Bowei Wang Dan Chen Benyun Shi Jindong Zhang Yifu Duan Jingying Chen and Ruimin Hu. 2017. Comprehensive Association Rules Mining of Health Examination Data with an Extended FP-Growth Method. In Mobile Networks and Applications .  Bowei Wang Dan Chen Benyun Shi Jindong Zhang Yifu Duan Jingying Chen and Ruimin Hu. 2017. Comprehensive Association Rules Mining of Health Examination Data with an Extended FP-Growth Method. In Mobile Networks and Applications .","DOI":"10.1007\/s11036-016-0793-6"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2007.86"},{"key":"e_1_2_1_28_1","volume-title":"Pal","author":"Witten Ian H.","year":"2017"},{"key":"e_1_2_1_29_1","unstructured":"Tzu-Tsung Wong and Kuo-Lung Tseng. 2005. Mining Negative Contrast Sets from Data with Discrete Attributes. In Expert Systems with Applications .  Tzu-Tsung Wong and Kuo-Lung Tseng. 2005. Mining Negative Contrast Sets from Data with Discrete Attributes. In Expert Systems with Applications ."},{"key":"e_1_2_1_30_1","unstructured":"Kenny Yu and Chunqiang (CQ) Tang. 2019. Efficient Reliable Cluster Management at Scale with Tupperware. https:\/\/engineering.fb.com\/data-center-engineering\/tupperware\/  Kenny Yu and Chunqiang (CQ) Tang. 2019. Efficient Reliable Cluster Management at Scale with Tupperware. https:\/\/engineering.fb.com\/data-center-engineering\/tupperware\/"},{"key":"e_1_2_1_31_1","volume-title":"Network Alarm Flood Pattern Mining Algorithm Based on Multi-dimensional Association. In ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems (MSWIM) .","author":"Zhang Xudong","year":"2018"},{"key":"e_1_2_1_32_1","unstructured":"Xiang Zhang Junbo Zhao and Yann LeCun. 2015. Character-Level Convolutional Networks for Text Classification. In Advances in neural information processing systems. 649--657.  Xiang Zhang Junbo Zhao and Yann LeCun. 2015. Character-Level Convolutional Networks for Text Classification. In Advances in neural information processing systems. 649--657."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/2733004.2733012"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/2785956.2787473"}],"container-title":["Proceedings of the ACM on Measurement and Analysis of Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3392149","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3392149","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:38:49Z","timestamp":1750199929000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3392149"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,6,9]]},"references-count":34,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2020,6,9]]}},"alternative-id":["10.1145\/3392149"],"URL":"https:\/\/doi.org\/10.1145\/3392149","relation":{},"ISSN":["2476-1249"],"issn-type":[{"value":"2476-1249","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,6,9]]},"assertion":[{"value":"2020-06-12","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}