{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,31]],"date-time":"2026-01-31T08:24:09Z","timestamp":1769847849956,"version":"3.49.0"},"reference-count":32,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,7,23]],"date-time":"2024-07-23T00:00:00Z","timestamp":1721692800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,7,23]],"date-time":"2024-07-23T00:00:00Z","timestamp":1721692800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Within the Hadoop ecosystem, MapReduce stands as a cornerstone for managing, processing, and mining large-scale datasets. Yet, the absence of efficient solutions for precise estimation of job execution times poses a persistent challenge, impacting task allocation and distribution within Hadoop clusters. In this study, we present a comprehensive machine learning approach for predicting the execution time of MapReduce jobs, encompassing data collection, preprocessing, feature engineering, and model evaluation. Leveraging a rich dataset derived from comprehensive Hadoop MapReduce job traces, we explore the intricate relationship between cluster parameters and job performance. Through a comparative analysis of machine learning models, including linear regression, decision tree, random forest, and gradient-boosted regression trees, we identify the random forest model as the most effective, demonstrating superior predictive accuracy and robustness. Our findings underscore the critical role of features such as data size and resource allocation in determining job performance. With this work, we aim to enhance resource management efficiency and enable more effective utilisation of cloud-based Hadoop clusters for large-scale data processing tasks.<\/jats:p>","DOI":"10.1186\/s40537-024-00964-z","type":"journal-article","created":{"date-parts":[[2024,7,23]],"date-time":"2024-07-23T10:12:46Z","timestamp":1721729566000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques"],"prefix":"10.1186","volume":"11","author":[{"given":"Mohammed","family":"Bergui","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Soufiane","family":"Hourri","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Said","family":"Najah","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nikola S.","family":"Nikolov","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2024,7,23]]},"reference":[{"key":"964_CR1","doi-asserted-by":"publisher","first-page":"330","DOI":"10.14778\/1920841.1920886","volume":"3","author":"S Melnik","year":"2010","unstructured":"Melnik S, Gubarev A, Long JJ, Romer G, Shivakumar S, Tolton M, et al. Dremel: interactive analysis of web-scale datasets. Proc VLDB Endow. 2010;3:330\u20139. https:\/\/doi.org\/10.14778\/1920841.1920886.","journal-title":"Proc VLDB Endow"},{"issue":"2","key":"964_CR2","doi-asserted-by":"publisher","first-page":"1626","DOI":"10.14778\/1687553.1687609","volume":"2","author":"A Thusoo","year":"2009","unstructured":"Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, et al. Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endow. 2009;2(2):1626. https:\/\/doi.org\/10.14778\/1687553.1687609.","journal-title":"Proc VLDB Endow"},{"key":"964_CR3","unstructured":": Apache Hadoop. http:\/\/hadoop.apache.org\/. Accessed 11 June 2024."},{"issue":"1","key":"964_CR4","doi-asserted-by":"publisher","first-page":"107","DOI":"10.1145\/1327452.1327492","volume":"51","author":"J Dean","year":"2008","unstructured":"Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107\u201311. https:\/\/doi.org\/10.1145\/1327452.1327492.","journal-title":"Commun ACM"},{"key":"964_CR5","unstructured":": Apache Hbase. http:\/\/hbase.apache.org\/. Accessed 11 June 2024."},{"key":"964_CR6","unstructured":": Apache Storm. http:\/\/storm.apache.org\/. Accessed 11 June 2024."},{"key":"964_CR7","unstructured":": Apache Giraph. http:\/\/giraph.apache.org\/. Accessed 11 June 2024."},{"key":"964_CR8","unstructured":": Apache Oozie. http:\/\/oozie.apache.org\/. Accessed 11 June 2024."},{"key":"964_CR9","unstructured":": Apache Mahout. http:\/\/mahout.apache.org\/. Accessed 11 June 2024."},{"key":"964_CR10","doi-asserted-by":"crossref","unstructured":"Chen CO, Zhuo YQ, Yeh CC, Lin CM, Liao SW. Machine learning-based configuration parameter tuning on Hadoop system. In: 2015 IEEE International Congress on Big Data. 2015; 386\u2013392.","DOI":"10.1109\/BigDataCongress.2015.64"},{"key":"964_CR11","doi-asserted-by":"crossref","unstructured":"Kadirvel S, Fortes JAB. Grey-Box approach for performance prediction in map-reduce based platforms. In: 2012 21st International Conference on Computer Communications and Networks (ICCCN). 2012; 1\u20139.","DOI":"10.1109\/ICCCN.2012.6289311"},{"key":"964_CR12","doi-asserted-by":"crossref","unstructured":"Lama P, Zhou X. AROMA: automated resource allocation and configuration of mapreduce environment in the cloud. In: Proceedings of the 9th International Conference on Autonomic Computing. ICAC \u201912. New York, NY, USA: Association for Computing Machinery. 2012; 63-72. https:\/\/doi.org\/10.1145\/2371536.2371547.","DOI":"10.1145\/2371536.2371547"},{"key":"964_CR13","doi-asserted-by":"crossref","unstructured":"Yang H, Luan Z, Li W, Qian D, Guan G. Statistics-based Workload Modeling for MapReduce. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. 2012; 2043\u20132051.","DOI":"10.1109\/IPDPSW.2012.254"},{"key":"964_CR14","doi-asserted-by":"crossref","unstructured":"Verma A, Cherkasova L, Campbell RH. ARIA: automatic resource inference and allocation for mapreduce environments. In: Proceedings of the 8th ACM International Conference on Autonomic Computing. ICAC \u201911. New York, NY, USA: Association for Computing Machinery. 2011; 235-244. https:\/\/doi.org\/10.1145\/1998582.1998637.","DOI":"10.1145\/1998582.1998637"},{"key":"964_CR15","unstructured":"Zhang Z, Cherkasova L, Loo BT. AutoTune: optimizing execution concurrency and resource usage in mapreduce workflows. In: 10th International Conference on Autonomic Computing (ICAC 13). San Jose, CA: USENIX Association. 2013; 175\u2013181. https:\/\/www.usenix.org\/conference\/icac13\/technical-sessions\/presentation\/zhang_zhuoyao."},{"key":"964_CR16","doi-asserted-by":"crossref","unstructured":"Gandhi A, Thota S, Dube P, Kochut A, Zhang L. Autoscaling for Hadoop Clusters. In: 2016 IEEE International Conference on Cloud Engineering (IC2E). 2016; 109\u2013118.","DOI":"10.1109\/IC2E.2016.11"},{"issue":"2","key":"964_CR17","doi-asserted-by":"publisher","first-page":"441","DOI":"10.1109\/TPDS.2015.2405552","volume":"27","author":"M Khan","year":"2016","unstructured":"Khan M, Jin Y, Li M, Xiang Y, Jiang C. Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans Parallel Distrib Syst. 2016;27(2):441\u20135. https:\/\/doi.org\/10.1109\/TPDS.2015.2405552.","journal-title":"IEEE Trans Parallel Distrib Syst"},{"key":"964_CR18","doi-asserted-by":"crossref","unstructured":"Song G, Meng Z, Huet F, Magoules F, Yu L, Lin X. A Hadoop MapReduce performance prediction method. In: 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing. 2013; 820\u2013825.","DOI":"10.1109\/HPCC.and.EUC.2013.118"},{"key":"964_CR19","doi-asserted-by":"crossref","unstructured":"Tariq H, Al-Sahaf H, Welch I. Modelling and prediction of resource utilization of Hadoop clusters: a machine learning approach. In: Proceedings of the 12th IEEE\/ACM International Conference on Utility and Cloud Computing. UCC\u201919. New York, NY, USA: Association for Computing Machinery. 2019; 93-100. https:\/\/doi.org\/10.1145\/3344341.3368821.","DOI":"10.1145\/3344341.3368821"},{"key":"964_CR20","doi-asserted-by":"crossref","unstructured":"Zhang Z, Cherkasova L, Loo BT. Benchmarking approach for designing a Mapreduce performance model. In: Proceedings of the 4th ACM\/SPEC International Conference on Performance Engineering. ICPE \u201913. New York, NY, USA: Association for Computing Machinery. 2013; 253-258. https:\/\/doi.org\/10.1145\/2479871.2479906.","DOI":"10.1145\/2479871.2479906"},{"key":"964_CR21","doi-asserted-by":"crossref","unstructured":"Sangroya A, Singhal R. Performance assurance model for HiveQL on large data volume. In: 2015 IEEE 22nd International Conference on High Performance Computing Workshops. 2015; 26\u201333.","DOI":"10.1109\/HiPCW.2015.8"},{"key":"964_CR22","doi-asserted-by":"crossref","unstructured":"Ceesay S, Barker A, Lin Y. Benchmarking and performance modelling of MapReduce communication pattern. In: 2019 IEEE International Conference on Cloud Computing Technology and Science (CloudCom). 2019; 127\u2013134.","DOI":"10.1109\/CloudCom.2019.00029"},{"key":"964_CR23","unstructured":": Dataproc | Google Cloud. https:\/\/cloud.google.com\/dataproc. Accessed 11 June 2024."},{"key":"964_CR24","doi-asserted-by":"publisher","first-page":"341","DOI":"10.1007\/978-3-031-28073-3_24","volume-title":"Advances in information and communication","author":"M Bergui","year":"2023","unstructured":"Bergui M, Nikolov NS, Najah S. Hadoop dataset for job estimation in the cloud with limited bandwidth. In: Arai K, editor. Advances in information and communication. Springer Nature Switzerland: Cham; 2023. p. 341\u20138."},{"key":"964_CR25","volume-title":"Hadoop: the definitive guide","author":"T White","year":"2015","unstructured":"White T. Hadoop: the definitive guide. 4th ed. Sebastopol: O\u2019Reilly Media, Inc.; 2015.","edition":"4"},{"key":"964_CR26","doi-asserted-by":"crossref","unstructured":"Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, et\u00a0al. Apache Hadoop YARN: Yet Another Resource Negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing. SOCC \u201913. ACM. 2013; 5:1\u20135:16.","DOI":"10.1145\/2523616.2523633"},{"key":"964_CR27","doi-asserted-by":"crossref","unstructured":"Babu S. Towards automatic optimization of MapReduce programs. In: Proceedings of the 1st ACM Symposium on Cloud Computing. SoCC \u201910. New York, NY, USA: Association for Computing Machinery. 2010; 137-142. https:\/\/doi.org\/10.1145\/1807128.1807150.","DOI":"10.1145\/1807128.1807150"},{"key":"964_CR28","doi-asserted-by":"publisher","first-page":"7177","DOI":"10.1007\/s11227-020-03162-9","volume":"76","author":"A Gandomi","year":"2020","unstructured":"Gandomi A, Movaghar A, Reshadi M, Khademzadeh A. Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach. J Supercomput. 2020;76:7177\u2013720. https:\/\/doi.org\/10.1007\/s11227-020-03162-9.","journal-title":"J Supercomput"},{"key":"964_CR29","unstructured":": TPCx-BB Express Big Data Benchmark. https:\/\/www.tpc.org\/tpcx-bb\/. Accessed 11 June 2024."},{"key":"964_CR30","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-020-00319-4","volume":"7","author":"N Peyravi","year":"2020","unstructured":"Peyravi N, Moeini A. Estimating runtime of a job in Hadoop MapReduce. J Big Data. 2020;7:1. https:\/\/doi.org\/10.1186\/s40537-020-00319-4.","journal-title":"J Big Data"},{"key":"964_CR31","doi-asserted-by":"publisher","unstructured":"Shi L, Wang Z, Yu W, Meng X. A case study of tuning MapReduce for efficient Bioinformatics in the cloud. Parallel Computing. 2017; 61: 83\u201395. Special Issue on 2015 Workshop on Data Intensive Scalable Computing Systems (DISCS-2015https:\/\/doi.org\/10.1016\/j.parco.2016.10.002.","DOI":"10.1016\/j.parco.2016.10.002"},{"key":"964_CR32","unstructured":": MapReduce Tutorial - Official Documentation. https:\/\/hadoop.apache.org\/docs\/stable\/hadoop-mapreduce-client\/hadoop-mapreduce-client-core\/MapReduceTutorial.html. Accessed 25 June 2024."}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-024-00964-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-024-00964-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-024-00964-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,23]],"date-time":"2024-07-23T10:20:32Z","timestamp":1721730032000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-024-00964-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,23]]},"references-count":32,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["964"],"URL":"https:\/\/doi.org\/10.1186\/s40537-024-00964-z","relation":{},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,23]]},"assertion":[{"value":"25 March 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 July 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 July 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"All authors have consented to the submission and publication of this manuscript.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing  interests relevant to the publication of this manuscript.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"98"}}