{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,4]],"date-time":"2026-02-04T21:00:56Z","timestamp":1770238856025,"version":"3.49.0"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"FSE","license":[{"start":{"date-parts":[[2024,7,12]],"date-time":"2024-07-12T00:00:00Z","timestamp":1720742400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"MUR, Ministero Universit\u00e0 e Ricerca","award":["2022EYX28N PRIN 2022"],"award-info":[{"award-number":["2022EYX28N PRIN 2022"]}]},{"name":"SNF Swiss National Foundation","award":["200021_178742"],"award-info":[{"award-number":["200021_178742"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Softw. Eng."],"published-print":{"date-parts":[[2024,7,12]]},"abstract":"<jats:p>Predicting failures in production environments allows service providers to activate countermeasures that prevent harming the users of the applications. The most successful approaches predict failures from error states that the current approaches identify from anomalies in time series of fixed sets of KPI values collected at runtime. They cannot handle time series of KPI sets with size that varies over time. Thus these approaches work with applications that run on statically configured sets of components and computational nodes, and do not scale up to the many popular cloud applications that exploit autoscaling.<\/jats:p>\n                  <jats:p>\n                    This paper proposes P\n                    <jats:sc>reface<\/jats:sc>\n                    , a novel approach to predict failures in cloud applications that exploit autoscaling. P\n                    <jats:sc>reface<\/jats:sc>\n                    originally augments the neural-network-based failure predictors successfully exploited to predict failures in statically configured applications, with a R\n                    <jats:sc>ectifier<\/jats:sc>\n                    layer that handles KPI sets of highly variable size as the ones collected in cloud autoscaling applications, and reduces those KPIs to a set of\n                    <jats:italic toggle=\"yes\">rectified-KPIs<\/jats:italic>\n                    of fixed size that can be fed to the neural-network predictor. The P\n                    <jats:sc>reface<\/jats:sc>\n                    R\n                    <jats:sc>ectifier<\/jats:sc>\n                    computes the\n                    <jats:italic toggle=\"yes\">rectified-KPIs<\/jats:italic>\n                    as descriptive statistics of the original KPIs, for each logical component of the target application. The descriptive statistics shrink the highly variable sets of KPIs collected at different timestamps to a fixed set of values compatible with the input nodes of the neural-network failure predictor. The neural network can then reveal anomalies that correspond to error states, before they propagate to failures that harm the users of the applications. The experiments on both a commercial application and a widely used academic exemplar confirm that P\n                    <jats:sc>reface<\/jats:sc>\n                    can indeed predict many harmful failures early enough to activate proper countermeasures.\n                  <\/jats:p>","DOI":"10.1145\/3660794","type":"journal-article","created":{"date-parts":[[2024,7,12]],"date-time":"2024-07-12T10:22:09Z","timestamp":1720779729000},"page":"1960-1981","source":"Crossref","is-referenced-by-count":2,"title":["Predicting Failures of Autoscaling Distributed Applications"],"prefix":"10.1145","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7566-8051","authenticated-orcid":false,"given":"Giovanni","family":"Denaro","sequence":"first","affiliation":[{"name":"University of Milano-Bicocca, Milan, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4871-3734","authenticated-orcid":false,"given":"Noura","family":"El Moussa","sequence":"additional","affiliation":[{"name":"Universit\u00e0 della Svizzera Italiana (USI), Lugano, Switzerland"},{"name":"Constructor Institute Schaffhausen, Schaffhausen, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-7428-2429","authenticated-orcid":false,"given":"Rahim","family":"Heydarov","sequence":"additional","affiliation":[{"name":"Universit\u00e0 della Svizzera Italiana (USI), Lugano, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3033-3044","authenticated-orcid":false,"given":"Francesco","family":"Lomio","sequence":"additional","affiliation":[{"name":"Constructor Institute Schaffhausen, Schaffhausen, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5193-7379","authenticated-orcid":false,"given":"Mauro","family":"Pezz\u00e8","sequence":"additional","affiliation":[{"name":"Universit\u00e0 della Svizzera Italiana (USI), Lugano, Switzerland"},{"name":"Constructor Institute Schaffhausen, Schaffhausen, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-9750-2762","authenticated-orcid":false,"given":"Ketai","family":"Qiu","sequence":"additional","affiliation":[{"name":"Universit\u00e0 della Svizzera Italiana (USI), Lugano, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,7,12]]},"reference":[{"key":"e_1_3_1_2_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2017.04.070"},{"key":"e_1_3_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICEngTechnol.2017.8308186"},{"key":"e_1_3_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2004.2"},{"key":"e_1_3_1_5_1","doi-asserted-by":"crossref","unstructured":"Peter Bodik Moises Goldszmidt Armando Fox Dawn B Woodard and Hans Andersen. 2010. Fingerprinting the datacenter: automated classification of performance crises. In Proceedings of the 5th European conference on Computer systems. 111\u2013124.","DOI":"10.1145\/1755913.1755926"},{"key":"e_1_3_1_6_1","doi-asserted-by":"crossref","first-page":"141","DOI":"10.1007\/978-3-319-48057-2_9","volume-title":"International Conference on Future Data and Security engineering","author":"Bontemps Lo\u00efc","year":"2016","unstructured":"Lo\u00efc Bontemps, James McDermott, Nhien-An Le-Khac, et al. 2016. Collective anomaly detection based on long shortterm memory recurrent neural networks. In International Conference on Future Data and Security engineering. Springer, 141\u2013152."},{"key":"e_1_3_1_7_1","first-page":"1","volume-title":"2008 IEEE International Symposium on Parallel and Distributed Processing","author":"Chung I-Hsin","year":"2008","unstructured":"I-Hsin Chung, Guojing Cong, David Klepacki, Simone Sbaraglia, Seetharami Seelam, and Hui-Fang Wen. 2008. A framework for automated performance bottleneck detection. In 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1\u20137."},{"key":"e_1_3_1_8_1","doi-asserted-by":"crossref","first-page":"544","DOI":"10.1109\/CLOUD.2017.75","volume-title":"2017 IEEE 10th International Conference on Cloud Computing (CLOUD)","author":"Davis Nickolas Allen","year":"2017","unstructured":"Nickolas Allen Davis, Abdelmounaam Rezgui, Hamdy Soliman, Skyler Manzanares, and Milagre Coates. 2017. Failuresim: A system for predicting hardware failures in cloud data centers using neural networks. In 2017 IEEE 10th International Conference on Cloud Computing (CLOUD). IEEE, 544\u2013551."},{"key":"e_1_3_1_9_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2208.11939"},{"key":"e_1_3_1_10_1","doi-asserted-by":"publisher","unstructured":"Giovanni Denaro Noura El Moussa Rahim Heydarov Francesco Lomio Mauro Pezz\u00e8 and Ketai Qiu. 2024. Preface Replication Package. https:\/\/doi.org\/10.5281\/zenodo.11160861 10.5281\/zenodo.11160861. Online; accessed 29 May 2024.","DOI":"10.5281\/zenodo.11160861"},{"key":"e_1_3_1_11_1","doi-asserted-by":"crossref","unstructured":"Min Du Feifei Li Guineng Zheng and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 1285\u20131298.","DOI":"10.1145\/3133956.3134015"},{"issue":"2","key":"e_1_3_1_12_1","doi-asserted-by":"crossref","first-page":"97","DOI":"10.3233\/MGS-170263","article-title":"A threshold sensitive failure prediction method using support vector machine","volume":"13","author":"Tehrani Ahmad Fadaei","year":"2017","unstructured":"Ahmad Fadaei Tehrani and Faramarz Safi-Esfahani. 2017. A threshold sensitive failure prediction method using support vector machine. Multiagent and Grid Systems 13, 2 (2017), 97\u2013111.","journal-title":"Multiagent and Grid Systems"},{"key":"e_1_3_1_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jnca.2015.11.024"},{"key":"e_1_3_1_14_1","first-page":"5","volume-title":"Proceedings of the USENIX conference on Analysis of system logs (WASL\u201908)","author":"Fulp Errin W.","year":"2008","unstructured":"Errin W. Fulp, Glenn A Fink, and Jereme N Haack. 2008. Predicting Computer System Failures Using Support Vector Machines.. In Proceedings of the USENIX conference on Analysis of system logs (WASL\u201908). USENIX Association, 5\u20135."},{"key":"e_1_3_1_15_1","article-title":"Task failure prediction in cloud data centers using deep learning","author":"Gao Jiechao","year":"2020","unstructured":"Jiechao Gao, Haoyu Wang, and Haiying Shen. 2020. Task failure prediction in cloud data centers using deep learning. IEEE Transactions on Services Computing (2020).","journal-title":"IEEE Transactions on Services Computing"},{"key":"e_1_3_1_16_1","doi-asserted-by":"crossref","unstructured":"Luca Gazzola Leonardo Mariani Fabrizio Pastore and Mauro Pezz\u00e8. 2017. An Exploratory Study of Field Failures. In Proceedings of the International Symposium on Software Reliability Engineering (ISSRE \u201917).","DOI":"10.1109\/ISSRE.2017.10"},{"key":"e_1_3_1_17_1","doi-asserted-by":"publisher","DOI":"10.5555\/3086952"},{"issue":"1","key":"e_1_3_1_18_1","first-page":"52","article-title":"Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems","volume":"7","author":"Guan Qiang","year":"2012","unstructured":"Qiang Guan, Ziming Zhang, and Song Fu. 2012. Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems. Journal of Communication 7, 1 (2012), 52\u201361.","journal-title":"Journal of Communication"},{"key":"e_1_3_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3342280.3342335"},{"issue":"1","key":"e_1_3_1_20_1","first-page":"4:1","article-title":"Performance Anomaly Detection and Bottleneck Identification","volume":"48","author":"Ibidunmoye Olumuyiwa","year":"2015","unstructured":"Olumuyiwa Ibidunmoye, Francisco Hern\u00e1ndez-Rodriguez, and Erik Elmroth. 2015. Performance Anomaly Detection and Bottleneck Identification. Comput. Surveys 48, 1 (2015), 4:1\u20134:35.","journal-title":"Comput. Surveys"},{"key":"e_1_3_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNSM.2017.2750906"},{"key":"e_1_3_1_22_1","doi-asserted-by":"crossref","unstructured":"Tariqul Islam and Dakshnamoorthy Manivannan. 2017. Predicting application failure in cloud: A machine learning approach. In 2017 IEEE International Conference on Cognitive Computing (ICCC). IEEE 24\u201331.","DOI":"10.1109\/IEEE.ICCC.2017.11"},{"key":"e_1_3_1_23_1","doi-asserted-by":"publisher","DOI":"10.1080\/08839514.2019.1637138"},{"key":"e_1_3_1_24_1","unstructured":"KubernetesDocs2022 2022. Kubernetes Documentation. https:\/\/kubernetes.io\/docs. [Online; accessed Aug-2022]."},{"key":"e_1_3_1_25_1","article-title":"A critical review of recurrent neural networks for sequence learning","author":"Lipton Zachary C","year":"2015","unstructured":"Zachary C Lipton, John Berkowitz, and Charles Elkan. 2015. A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019 (2015).","journal-title":"arXiv preprint arXiv:1506.00019"},{"key":"e_1_3_1_26_1","doi-asserted-by":"crossref","unstructured":"Joao Paulo Magalhaes and Luis Moura Silva. 2011. Root-cause analysis of performance anomalies in web-based applications. In Proceedings of the 2011 ACM Symposium on Applied Computing. 209\u2013216.","DOI":"10.1145\/1982185.1982234"},{"key":"e_1_3_1_27_1","first-page":"1012","volume-title":"Proceedings of the International Conference on Software Engineering (ICSE \u201913)","author":"Malik H.","year":"2013","unstructured":"H. Malik, H. Hemmati, and A. E. Hassan. 2013. Automatic detection of performance deviations in the load testing of Large Scale Systems. In Proceedings of the International Conference on Software Engineering (ICSE \u201913). IEEE Computer Society, 1012\u20131021."},{"key":"e_1_3_1_28_1","doi-asserted-by":"publisher","DOI":"10.1214\/aoms\/1177730491"},{"key":"e_1_3_1_29_1","first-page":"262","volume-title":"Proceedings of the International Conference on Software Testing, Verification and Validation (ICST \u201918)","author":"Mariani Leonardo","year":"2018","unstructured":"Leonardo Mariani, Cristina Monni, Mauro Pezz\u00e8, Oliviero Riganelli, and Rui Xin. 2018. Localizing Faults in Cloud Systems. In Proceedings of the International Conference on Software Testing, Verification and Validation (ICST \u201918). IEEE Computer Society, 262\u2013273."},{"key":"e_1_3_1_30_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jss.2019.110464"},{"key":"e_1_3_1_31_1","doi-asserted-by":"publisher","DOI":"10.5555\/2600239.2600241"},{"key":"e_1_3_1_32_1","doi-asserted-by":"crossref","unstructured":"Gr\u00e9goire Mesnil Xiaodong He Li Deng and Yoshua Bengio. 2013. Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding.. In Interspeech. 3771\u20133775.","DOI":"10.21437\/Interspeech.2013-596"},{"key":"e_1_3_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICST.2019.00024"},{"key":"e_1_3_1_34_1","volume-title":"Introduction to descriptive statistics","author":"Nicholas Jackie","year":"1990","unstructured":"Jackie Nicholas. 1990. Introduction to descriptive statistics. Mathematics Learning Centre, University of Sydney."},{"key":"e_1_3_1_35_1","doi-asserted-by":"crossref","first-page":"282","DOI":"10.1145\/2610384.2610410","volume-title":"Proceedings of the International Symposium on Software Testing and Analysis (ISSTA \u201914)","author":"Nistor Adrian","year":"2014","unstructured":"Adrian Nistor and Lenin Ravindranath. 2014. SunCat: Helping Developers Understand and Predict Performance Problems in Smartphone Applications. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA \u201914). ACM, 282\u2013292."},{"key":"e_1_3_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2015.2442577"},{"key":"e_1_3_1_37_1","first-page":"1","article-title":"Diagnosing Performance Changes by Comparing Request Flows.","volume":"5","author":"Sambasivan Raja R","year":"2011","unstructured":"Raja R Sambasivan, Alice X Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R Ganger. 2011. Diagnosing Performance Changes by Comparing Request Flows.. In NSDI, Vol. 5. 1\u20131.","journal-title":"NSDI"},{"key":"e_1_3_1_38_1","first-page":"196","volume-title":"Proceedings of the International Symposium on Software Reliability Engineering (ISSRE \u201916)","author":"Sauvanaud C.","year":"2016","unstructured":"C. Sauvanaud, K. Lazri, M. Ka\u00e2niche, and K. Kanoun. 2016. Anomaly Detection and Root Cause Localization in Virtual Network Functions. In Proceedings of the International Symposium on Software Reliability Engineering (ISSRE \u201916). IEEE Computer Society, 196\u2013206."},{"key":"e_1_3_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3316781.3317918"},{"key":"e_1_3_1_40_1","first-page":"173","volume-title":"Proceedings of the Symposium on Principles of Distributed Computing (PODC \u201912)","author":"Tan Yongmin","year":"2010","unstructured":"Yongmin Tan, Xiaohui Gu, and Haixun Wang. 2010. Adaptive system anomaly prediction for large-scale hosting infrastructures. In Proceedings of the Symposium on Principles of Distributed Computing (PODC \u201912). ACM, 173\u2013182."},{"key":"e_1_3_1_41_1","first-page":"285","volume-title":"2012 IEEE 32nd International Conference on Distributed Computing Systems","author":"Tan Yongmin","year":"2012","unstructured":"Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Chitra Venkatramani, and Deepak Rajan. 2012. Prepare: Predictive performance anomaly prevention for virtualized cloud systems. In 2012 IEEE 32nd International Conference on Distributed Computing Systems. IEEE, 285\u2013294."}],"container-title":["Proceedings of the ACM on Software Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3660794","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3660794","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,4]],"date-time":"2026-02-04T07:54:08Z","timestamp":1770191648000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3660794"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,12]]},"references-count":40,"journal-issue":{"issue":"FSE","published-print":{"date-parts":[[2024,7,12]]}},"alternative-id":["10.1145\/3660794"],"URL":"https:\/\/doi.org\/10.1145\/3660794","relation":{},"ISSN":["2994-970X"],"issn-type":[{"value":"2994-970X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,12]]}}}