{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,4,26]],"date-time":"2024-04-26T22:25:31Z","timestamp":1714170331280},"reference-count":45,"publisher":"Oxford University Press (OUP)","issue":"12","license":[{"start":{"date-parts":[[2021,9,20]],"date-time":"2021-09-20T00:00:00Z","timestamp":1632096000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"name":"Developing Data-Intensive Cloud Applications with Iterative Quality Enhancements","award":["644869"],"award-info":[{"award-number":["644869"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,12,30]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Nowadays, Apache Hadoop and Apache Spark are two of the most prominent distributed solutions for processing big data applications on the market. Since in many cases these frameworks are adopted to support business critical activities, it is often important to predict with fair confidence the execution time of submitted applications, for instance when service-level agreements are established with end-users. In this work, we propose and validate a hybrid approach for the performance prediction of big data applications running on clouds, which exploits both analytical modeling and machine learning (ML) techniques and it is able to achieve a good accuracy without too many time consuming and costly experiments on a real setup. The experimental results show how the proposed approach attains improvement in accuracy, number of experiments to be run on the operational system and cost over applying ML techniques without any support from analytical models. Moreover, we compare our approach with Ernest, an ML-based technique proposed in the literature by the Spark inventors. Experiments show that Ernest can accurately estimate the performance in interpolating scenarios while it fails to predict the performance when configurations with increasing number of cores are considered. Finally, a comparison with a similar hybrid approach proposed in the literature demonstrates how our approach significantly reduce prediction errors especially when few experiments on the real system are performed.<\/jats:p>","DOI":"10.1093\/comjnl\/bxab131","type":"journal-article","created":{"date-parts":[[2021,9,8]],"date-time":"2021-09-08T11:19:10Z","timestamp":1631099950000},"page":"3123-3140","source":"Crossref","is-referenced-by-count":3,"title":["A Hybrid Machine Learning Approach for Performance Modeling of Cloud-Based Big Data Applications"],"prefix":"10.1093","volume":"65","author":[{"given":"Ehsan","family":"Ataie","sequence":"first","affiliation":[{"name":"Department of Computer Engineering , University of Mazandaran, Babolsar, Iran"},{"name":"Distributed Computing Systems Research Group , University of Mazandaran, Babolsar, Iran"}]},{"given":"Athanasia","family":"Evangelinou","sequence":"additional","affiliation":[{"name":"Dipartimento di Elettronica , Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy"}]},{"given":"Eugenio","family":"Gianniti","sequence":"additional","affiliation":[{"name":"Dipartimento di Elettronica , Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy"}]},{"given":"Danilo","family":"Ardagna","sequence":"additional","affiliation":[{"name":"Dipartimento di Elettronica , Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy"}]}],"member":"286","published-online":{"date-parts":[[2021,9,20]]},"reference":[{"key":"2023010312515963200_ref1","first-page":"469","article-title":"CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics","volume-title":"Proc. 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201917)","author":"Alipourfard"},{"key":"2023010312515963200_ref2","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1093\/comjnl\/bxx079","article-title":"Integrating and querying OpenStreetMap and linked geo open data","volume":"62","author":"Almendros-Jim\u00e9nez","year":"2019","journal-title":"Comput. J."},{"key":"2023010312515963200_ref3","doi-asserted-by":"crossref","first-page":"192","DOI":"10.1145\/3184407.3184420","article-title":"Performance Prediction of Cloud-Based Big Data Applications","volume-title":"Proc. 2018 ACM\/SPEC Int. Conf. Performance Engineering (ICPE\u201918)","author":"Ardagna","year":"2018"},{"key":"2023010312515963200_ref4","doi-asserted-by":"crossref","first-page":"599","DOI":"10.1007\/978-3-319-49583-5_47","article-title":"Modeling Performance of Hadoop Applications: A Journey from Queueing Networks to Stochastic Well Formed Nets","volume-title":"Proc. Int. Conf. Algorithms and Architectures for Parallel Processing (ICA3PP\u201916)","author":"Ardagna","year":"2016"},{"key":"2023010312515963200_ref5","first-page":"1","article-title":"Rethinking the Use of Models in Software Architecture","volume-title":"Proc. 4th Int. Conf. Quality of Software Architectures (QoSA\u201908)","author":"Ardagna","year":"2008"},{"key":"2023010312515963200_ref6","doi-asserted-by":"crossref","first-page":"40","DOI":"10.1214\/09-SS054","article-title":"A survey of cross-validation procedures for model selection","volume":"4","author":"Arlot","year":"2010","journal-title":"Stat. Surv."},{"key":"2023010312515963200_ref7","first-page":"1","article-title":"A Combined Analytical Modeling Machine Learning Approach for Performance Prediction of MapReduce Jobs in Cloud Environment","volume-title":"Proc. 18th Int. Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC\u201916)","author":"Ataie","year":"2016"},{"key":"2023010312515963200_ref8","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1145\/1530873.1530877","article-title":"JMT: Performance engineering tools for system modeling","volume":"36","author":"Bertoli","year":"2009","journal-title":"ACM SIGMETRICS Perf. Eval. Rev."},{"key":"2023010312515963200_ref9","doi-asserted-by":"crossref","first-page":"15","DOI":"10.1016\/j.peva.2014.09.001","article-title":"Blending randomness in closed queueing network models","volume":"82","author":"Casale","year":"2014","journal-title":"Perf. Eval."},{"key":"2023010312515963200_ref10","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/1961189.1961199","article-title":"LIBSVM: A library for support vector machines","volume":"2","author":"Chang","year":"2011","journal-title":"ACM Trans. Intell. Syst. Technol."},{"key":"2023010312515963200_ref11","first-page":"479","article-title":"BOAT: Building Auto-Tuners with Structured Bayesian Optimization","volume-title":"Proc. 26th Int. Conf. World Wide Web (WWW\u201917)","author":"Dalibard","year":"2017"},{"key":"2023010312515963200_ref12","first-page":"127","article-title":"Quasar: Resource-Efficient and QoS-Aware Cluster Management","volume-title":"Proc. 19th Int. Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201914)","author":"Delimitrou","year":"2014"},{"key":"2023010312515963200_ref13","doi-asserted-by":"crossref","first-page":"939","DOI":"10.1007\/s00607-013-0376-3","article-title":"Identifying the optimal level of parallelism in transactional memory applications","volume":"97","author":"Didona","year":"2015","journal-title":"Computing"},{"key":"2023010312515963200_ref14","doi-asserted-by":"crossref","first-page":"145","DOI":"10.1145\/2668930.2688047","article-title":"Enhancing Performance Prediction Robustness by Combining Analytical Modeling and Machine Learning","volume-title":"Proc. 6th ACM\/SPEC Int. Conf. Performance Engineering (ICPE\u201915)","author":"Didona","year":"2015"},{"key":"2023010312515963200_ref15","article-title":"On bootstrapping machine learning performance predictors via analytical models","author":"Didona","year":"2014"},{"key":"2023010312515963200_ref16","doi-asserted-by":"crossref","first-page":"341","DOI":"10.1145\/2668930.2688823","article-title":"Hybrid Machine Learning\/Analytical Models for Performance Prediction: A Tutorial","volume-title":"Proc. 6th ACM\/SPEC Int. Conf. Performance Engineering (ICPE\u201915)","author":"Didona","year":"2015"},{"key":"2023010312515963200_ref17","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2620001","article-title":"Transactional auto scaler: Elastic scaling of replicated in-memory transactional data grids","volume":"9","author":"Didona","year":"2014","journal-title":"ACM Trans. Auton. Adapt. Syst."},{"key":"2023010312515963200_ref18","doi-asserted-by":"crossref","first-page":"1671","DOI":"10.1093\/comjnl\/bxz020","article-title":"Bigfeel\u2014a distributed processing environment for the integration of sentiment analysis methods","volume":"62","author":"Ferreira","year":"2019","journal-title":"Comput. J."},{"key":"2023010312515963200_ref19","first-page":"188","article-title":"Stage Aware Performance Modeling of DAG Based in Memory Analytic Platforms","volume-title":"Proc. 9th Int. Conf. Cloud Computing (CLOUD)","author":"Gibilisco","year":"2016"},{"key":"2023010312515963200_ref20","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2038916.2038934","article-title":"No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-Intensive Analytics","volume-title":"Proc. 2nd ACM Symposium on Cloud Computing (SOCC\u201911)","author":"Herodotou","year":"2011"},{"key":"2023010312515963200_ref21","volume-title":"Move big data to the public cloud with an insight PaaS","author":"Hopkins"},{"key":"2023010312515963200_ref22","first-page":"128","article-title":"Collective I\/O Tuning Using Analytical and Machine Learning Models","volume-title":"Proc. IEEE Int. Conf. Cluster Computing","author":"Isaila","year":"2015"},{"key":"2023010312515963200_ref23","first-page":"63","article-title":"AROMA: Automated Resource Allocation and Configuration of MapReduce Environment in the Cloud","volume-title":"Proc. 9th Int. Conf. Auton. Comput.","author":"Lama","year":"2012"},{"key":"2023010312515963200_ref24","volume-title":"Quantitative System Performance: Computer System Analysis Using Queueing Network Models","author":"Lazowska","year":"1984"},{"key":"2023010312515963200_ref25","doi-asserted-by":"crossref","first-page":"720","DOI":"10.1016\/j.peva.2013.08.013","article-title":"Joint optimization of overlapping phases in MapReduce","volume":"70","author":"Lin","year":"2013","journal-title":"Perf. Eval."},{"key":"2023010312515963200_ref26","doi-asserted-by":"crossref","first-page":"51","DOI":"10.1145\/2788402.2788410","article-title":"Optimal map reduce job capacity allocation in cloud systems","volume":"42","author":"Malekimajd","year":"2015","journal-title":"ACM SIGMETRICS Perf. Eval. Rev."},{"key":"2023010312515963200_ref27","article-title":"A survey of big data machine learning applications optimization in cloud data centers and networks","author":"Mohamed","year":"2019"},{"key":"2023010312515963200_ref28","first-page":"293","article-title":"Making Sense of Performance in Data Analytics Frameworks","volume-title":"Proc. 12th USENIX Conf. Networked Systems Design and Implementation (NSDI\u201915)","author":"Ousterhout","year":"2015"},{"key":"2023010312515963200_ref29","article-title":"Hemingway: Modeling distributed optimization algorithms","author":"Pan","year":"2017"},{"key":"2023010312515963200_ref30","doi-asserted-by":"crossref","first-page":"231","DOI":"10.1109\/TNSM.2012.122112.110163","article-title":"Deadline-based MapReduce workload management","volume":"10","author":"Polo","year":"2013","journal-title":"IEEE Trans. Netw. Service Manag."},{"key":"2023010312515963200_ref31","doi-asserted-by":"crossref","first-page":"295","DOI":"10.1093\/comjnl\/bxz125","article-title":"Virtual infrastructure orchestration for cloud service deployment","volume":"63","author":"Qadeer","year":"2020","journal-title":"Comput. J."},{"key":"2023010312515963200_ref32","article-title":"Support vector regression model for BigData systems","author":"Rizzi","year":"2016"},{"key":"2023010312515963200_ref33","first-page":"81","article-title":"Analytical\/ML Mixed Approach for Concurrency Regulation in Software Transactional Memory","volume-title":"Proc. 14th IEEE\/ACM Int. Symposium on Cluster, Cloud and Grid Computing (CCGrid\u201914)","author":"Rughetti","year":"2014"},{"key":"2023010312515963200_ref34","doi-asserted-by":"crossref","first-page":"1357","DOI":"10.1145\/2723372.2742790","article-title":"Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications","volume-title":"Proc. 2015 ACM SIGMOD Int. Conf. Management of Data","author":"Saha","year":"2015"},{"key":"2023010312515963200_ref35","doi-asserted-by":"crossref","first-page":"219","DOI":"10.1016\/j.simpat.2015.05.011","article-title":"A flexible framework for accurate simulation of cloud in-memory data stores","volume":"58","author":"Di Sanzo","year":"2015","journal-title":"Simul. Model. Pract. Theory"},{"key":"2023010312515963200_ref36","volume-title":"The digital universe","author":"Shirer","year":"2020"},{"key":"2023010312515963200_ref37","doi-asserted-by":"crossref","first-page":"287","DOI":"10.1007\/s10586-007-0035-6","article-title":"On the use of hybrid reinforcement learning for autonomic resource allocation","volume":"10","author":"Tesauro","year":"2007","journal-title":"Cluster Comput."},{"key":"2023010312515963200_ref38","doi-asserted-by":"crossref","first-page":"253","DOI":"10.1145\/1384529.1375486","article-title":"IRONModel: Robust performance models in the wild","volume":"36","author":"Thereska","year":"2008","journal-title":"ACM SIGMETRICS Perf. Eval. Rev."},{"key":"2023010312515963200_ref39","first-page":"363","article-title":"Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics","volume-title":"Proc. 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201916)","author":"Venkataraman","year":"2016"},{"key":"2023010312515963200_ref40","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1145\/1998582.1998637","article-title":"ARIA: Automatic Resource Inference and Allocation for MapReduce Environments","volume-title":"Proc. 8th ACM International Conference on Autonomic Computing (ICAC\u201911)","author":"Verma","year":"2011"},{"key":"2023010312515963200_ref41","doi-asserted-by":"crossref","first-page":"40534","DOI":"10.1109\/ACCESS.2019.2907018","article-title":"An energy efficiency optimization and control model for hadoop clusters","volume":"7","author":"Wang","year":"2019","journal-title":"IEEE Access"},{"key":"2023010312515963200_ref42","volume-title":"A decade later, apache spark still going strong","author":"Woodie"},{"key":"2023010312515963200_ref43","first-page":"11","article-title":"Towards Machine Learning-Based Auto-Tuning of MapReduce","volume-title":"Proc. 21st Int. Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems","author":"Yigitbasi","year":"2013"},{"key":"2023010312515963200_ref44","first-page":"10","article-title":"Spark: Cluster Computing with Working Sets","volume-title":"Proc. 2nd USENIX Conf. Hot Topics in Cloud Computing (HotCloud\u201910)","author":"Zaharia","year":"2010"},{"key":"2023010312515963200_ref45","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1145\/2934664","article-title":"Apache spark: A unified engine for big data processing","volume":"59","author":"Zaharia","year":"2016","journal-title":"Commun. ACM"}],"container-title":["The Computer Journal"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/comjnl\/article-pdf\/65\/12\/3123\/48480719\/bxab131.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/comjnl\/article-pdf\/65\/12\/3123\/48480719\/bxab131.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,3]],"date-time":"2023-01-03T12:53:10Z","timestamp":1672750390000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/comjnl\/article\/65\/12\/3123\/6372951"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,9,20]]},"references-count":45,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2021,9,20]]},"published-print":{"date-parts":[[2022,12,30]]}},"URL":"https:\/\/doi.org\/10.1093\/comjnl\/bxab131","relation":{},"ISSN":["0010-4620","1460-2067"],"issn-type":[{"value":"0010-4620","type":"print"},{"value":"1460-2067","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,12]]},"published":{"date-parts":[[2021,9,20]]}}}