{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,27]],"date-time":"2026-05-27T21:02:20Z","timestamp":1779915740558,"version":"3.53.1"},"reference-count":38,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2020,12,1]],"date-time":"2020-12-01T00:00:00Z","timestamp":1606780800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,12,14]],"date-time":"2020-12-14T00:00:00Z","timestamp":1607904000000},"content-version":"vor","delay-in-days":13,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"published-print":{"date-parts":[[2020,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Big Data analytics for storing, processing, and analyzing large-scale datasets has become an essential tool for the industry. The advent of distributed computing frameworks such as Hadoop and Spark offers efficient solutions to analyze vast amounts of data. Due to the application programming interface (API) availability and its performance, Spark becomes very popular, even more popular than the MapReduce framework. Both these frameworks have more than 150 parameters, and the combination of these parameters has a massive impact on cluster performance. The default system parameters help the system administrator deploy their system applications without much effort, and they can measure their specific cluster performance with factory-set parameters. However, an open question remains: can new parameter selection improve cluster performance for large datasets? In this regard, this study investigates the most impacting parameters, under resource utilization, input splits, and shuffle, to compare the performance between Hadoop and Spark, using an implemented cluster in our laboratory. We used a trial-and-error approach for tuning these parameters based on a large number of experiments. In order to evaluate the frameworks of comparative analysis, we select two workloads: WordCount and TeraSort. The performance metrics are carried out based on three criteria: execution time, throughput, and speedup. Our experimental results revealed that both system performances heavily depends on input data size and correct parameter selection. The analysis of the results shows that Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.<\/jats:p>","DOI":"10.1186\/s40537-020-00388-5","type":"journal-article","created":{"date-parts":[[2020,12,14]],"date-time":"2020-12-14T07:05:22Z","timestamp":1607929522000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":84,"title":["A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench"],"prefix":"10.1186","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5663-0042","authenticated-orcid":false,"given":"N.","family":"Ahmed","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7648-285X","authenticated-orcid":false,"given":"Andre L. C.","family":"Barczak","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9416-1435","authenticated-orcid":false,"given":"Teo","family":"Susnjak","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0844-5819","authenticated-orcid":false,"given":"Mohammed A.","family":"Rashid","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2020,12,14]]},"reference":[{"key":"388_CR1","unstructured":"Apache Hadoop Documentation 2014. http:\/\/hadoop.apache.org\/. Accessed 15 July 2020."},{"key":"388_CR2","doi-asserted-by":"crossref","unstructured":"Verma A, Mansuri AH, Jain N. Big data management processing with hadoop mapreduce and spark technology: A comparison. In: 2016 symposium on colossal data analysis and networking (CDAN). New York: IEEE; 2016. p. 1\u20134.","DOI":"10.1109\/CDAN.2016.7570891"},{"key":"388_CR3","doi-asserted-by":"publisher","DOI":"10.4018\/978-1-4666-9814-7","volume-title":"Big Data: concepts, methodologies, tools, and applications","author":"IR Management Association","year":"2016","unstructured":"Management Association IR. Big Data: concepts, methodologies, tools, and applications. Hershey: IGI Global; 2016."},{"key":"388_CR4","first-page":"45","volume":"37","author":"M Zaharia","year":"2012","unstructured":"Zaharia M, Chowdhury M, Das T, Dave A, Ma J, Mccauley M, Franklin M, Shenker S, Stoica I. Fast and interactive analytics over hadoop data with spark. Usenix Login. 2012;37:45\u201351.","journal-title":"Usenix Login"},{"issue":"1","key":"388_CR5","doi-asserted-by":"publisher","first-page":"107","DOI":"10.1145\/1327452.1327492","volume":"51","author":"J Dean","year":"2008","unstructured":"Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107\u201313.","journal-title":"Commun ACM"},{"key":"388_CR6","doi-asserted-by":"crossref","unstructured":"Wang G, Butt AR, Pandey P, Gupta K. Using realistic simulation for performance analysis of mapreduce setups. In: Proceedings of the 1st ACM workshop on large-scale system and application performance; 2009. p. 19\u201326.","DOI":"10.1145\/1552272.1552278"},{"key":"388_CR7","doi-asserted-by":"crossref","unstructured":"Samadi Y, Zbakh M, Tadonki C. Comparative study between hadoop and spark based on hibench benchmarks. In: 2016 2nd international conference on cloud computing technologies and applications (CloudTech). New York: IEEE; 2016. p. 267\u201375.","DOI":"10.1109\/CloudTech.2016.7847709"},{"issue":"1","key":"388_CR8","doi-asserted-by":"publisher","first-page":"20","DOI":"10.1186\/s40537-019-0185-4","volume":"6","author":"H Ahmadvand","year":"2019","unstructured":"Ahmadvand H, Goudarzi M, Foroutan F. Gapprox: using gallup approach for approximation in big data processing. J Big Data. 2019;6(1):20.","journal-title":"J Big Data"},{"issue":"12","key":"388_CR9","doi-asserted-by":"publisher","first-page":"4367","DOI":"10.1002\/cpe.4367","volume":"30","author":"Y Samadi","year":"2018","unstructured":"Samadi Y, Zbakh M, Tadonki C. Performance comparison between hadoop and spark frameworks using hibench benchmarks. Concurr Comput Pract Exp. 2018;30(12):4367.","journal-title":"Concurr Comput Pract Exp"},{"issue":"13","key":"388_CR10","doi-asserted-by":"publisher","first-page":"2110","DOI":"10.14778\/2831360.2831365","volume":"8","author":"J Shi","year":"2015","unstructured":"Shi J, Qiu Y, Minhas UF, Jiao L, Wang C, Reinwald B, \u00d6zcan F. Clash of the titans: mapreduce vs. spark for large scale data analytics. Proc VLDB Endow. 2015;8(13):2110\u2013211.","journal-title":"Proc VLDB Endow"},{"key":"388_CR11","doi-asserted-by":"crossref","unstructured":"Veiga J, Exp\u00f3sito RR, Pardo XC, Taboada GL, Tourifio J. Performance evaluation of big data frameworks for large-scale data analytics. In: 2016 ieee international conference on Big Data (Big Data). New York: IEEE; 2016. p. 424\u201331.","DOI":"10.1109\/BigData.2016.7840633"},{"key":"388_CR12","doi-asserted-by":"crossref","unstructured":"Li M, Tan J, Wang Y, Zhang L, Salapura V. Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: Proceedings of the 12th ACM international conference on computing frontiers; 2015. p. 1\u20138.","DOI":"10.1145\/2742854.2747283"},{"key":"388_CR13","doi-asserted-by":"crossref","unstructured":"Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S. Bigdatabench: a big data benchmark suite from internet services. In: 2014 IEEE 20th international symposium on high performance computer architecture (HPCA). New York: IEEE; 2014. p. 488\u201399.","DOI":"10.1109\/HPCA.2014.6835958"},{"key":"388_CR14","unstructured":"Thiruvathukal GK, Christensen C, Jin X, Tessier F, Vishwanath V. A benchmarking study to evaluate apache spark on large-scale supercomputers. 2019; arXiv preprint arXiv:1904.11812."},{"key":"388_CR15","doi-asserted-by":"crossref","unstructured":"Marcu O-C, Costan A, Antoniu G, P\u00e9rez-Hern\u00e1ndez MS. Spark versus flink: Understanding performance in big data analytics frameworks. In: 2016 IEEE international conference on cluster computing (CLUSTER). New York: IEEE; 2016. p. 433\u201342.","DOI":"10.1109\/CLUSTER.2016.22"},{"issue":"4","key":"388_CR16","doi-asserted-by":"publisher","first-page":"481","DOI":"10.1177\/1094342006070078","volume":"20","author":"R Bolze","year":"2006","unstructured":"Bolze R, Cappello F, Caron E, Dayd\u00e9 M, Desprez F, Jeannot E, J\u00e9gou Y, Lanteri S, Leduc J, Melab N, et al. Grid\u20195000: a large scale and highly reconfigurable experimental grid testbed. Int J High Perform Comput Appl. 2006;20(4):481\u201394.","journal-title":"Int J High Perform Comput Appl"},{"key":"388_CR17","unstructured":"Mavridis I, Karatza E. Log file analysis in cloud with apache hadoop and apache spark 2015."},{"issue":"1","key":"388_CR18","first-page":"8","volume":"113","author":"S Gopalani","year":"2015","unstructured":"Gopalani S, Arora R. Comparing apache spark and map reduce with performance analysis using k-means. Int J Comput Appl. 2015;113(1):8\u201311.","journal-title":"Int J Comput Appl."},{"key":"388_CR19","doi-asserted-by":"crossref","unstructured":"Gu L, Li H. Memory or time: Performance evaluation for iterative operation on hadoop and spark. In: 2013 IEEE 10th international conference on high performance computing and communications & 2013 IEEE international conference on embedded and ubiquitous computing. New York: IEEE; 2013. p. 721\u20137.","DOI":"10.1109\/HPCC.and.EUC.2013.106"},{"key":"388_CR20","doi-asserted-by":"crossref","unstructured":"Lin X, Wang P, Wu B. Log analysis in cloud computing environment with hadoop and spark. In: 2013 5th IEEE international conference on broadband network & multimedia technology. New York: IEEE; 2013. p. 273\u20136.","DOI":"10.1109\/ICBNMT.2013.6823956"},{"key":"388_CR21","doi-asserted-by":"crossref","unstructured":"Petridis P, Gounaris A, Torres J. Spark parameter tuning via trial-and-error. In: INNS conference on big data. Berlin: Springer; 2016. p. 226\u201337.","DOI":"10.1007\/978-3-319-47898-2_24"},{"issue":"1","key":"388_CR22","doi-asserted-by":"publisher","first-page":"24","DOI":"10.1186\/s40537-015-0032-1","volume":"2","author":"S Landset","year":"2015","unstructured":"Landset S, Khoshgoftaar TM, Richter AN, Hasanin T. A survey of open source tools for machine learning with big data in the hadoop ecosystem. J Big Data. 2015;2(1):24.","journal-title":"J Big Data"},{"key":"388_CR23","unstructured":"HiBench Benchmark Suite. https:\/\/github.com\/intel-hadoop\/HiBench. Accessed 15 July 2020."},{"key":"388_CR24","doi-asserted-by":"crossref","unstructured":"Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). New York: IEEE; 2010. p. 1\u201310.","DOI":"10.1109\/MSST.2010.5496972"},{"key":"388_CR25","doi-asserted-by":"crossref","unstructured":"Luo M, Yokota H. Comparing hadoop and fat-btree based access method for small file i\/o applications. In: International conference on web-age information management. Berlin: Springer; 2010. p. 182\u201393.","DOI":"10.1007\/978-3-642-14246-8_20"},{"key":"388_CR26","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/1471-2105-11-S12-S1","volume":"11","author":"RC Taylor","year":"2010","unstructured":"Taylor RC. An overview of the hadoop\/mapreduce\/hbase framework and its current applications in bioinformatics. BMC Bioinform. 2010;11:1.","journal-title":"BMC Bioinform"},{"key":"388_CR27","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4842-2199-0","volume-title":"Practical Hadoop ecosystem: a definitive guide to hadoop-related frameworks and tools","author":"D Vohra","year":"2016","unstructured":"Vohra D. Practical Hadoop ecosystem: a definitive guide to hadoop-related frameworks and tools. California: Apress; 2016."},{"issue":"4","key":"388_CR28","doi-asserted-by":"publisher","first-page":"11","DOI":"10.1145\/2094114.2094118","volume":"40","author":"K-H Lee","year":"2012","unstructured":"Lee K-H, Lee Y-J, Choi H, Chung YD, Moon B. Parallel data processing with mapreduce: a survey. AcM sIGMoD record. 2012;40(4):11\u201320.","journal-title":"AcM sIGMoD record"},{"key":"388_CR29","first-page":"95","volume":"10","author":"M Zaharia","year":"2010","unstructured":"Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: cluster computing with working sets. HotCloud. 2010;10:95.","journal-title":"HotCloud"},{"key":"388_CR30","unstructured":"Kannan P. Beyond hadoop mapreduce apache tez and apache spark. San Jose State University); 2015. http:\/\/www.sjsu.edu\/people\/robert.chun\/courses\/CS259Fall2013\/s3\/F.pdf. Accessed 15 July 2020."},{"key":"388_CR31","unstructured":"Spark Core Programming. https:\/\/www.tutorialspoint.com\/apache_spark\/apache_spark_rdd.htm. Accessed 15 July 2020."},{"key":"388_CR32","doi-asserted-by":"crossref","unstructured":"Huang S, Huang J, Dai J, Xie T, Huang B. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th international conference on data engineering workshops (ICDEW 2010). New York: IEEE; 2010. p. 41\u201351.","DOI":"10.1109\/ICDEW.2010.5452747"},{"key":"388_CR33","doi-asserted-by":"crossref","unstructured":"Chen C-O, Zhuo Y-Q, Yeh C-C, Lin C-M, Liao S-W. Machine learning-based configuration parameter tuning on hadoop system. In: 2015 IEEE international congress on big data. New York: IEEE; 2015. p. 386\u201392.","DOI":"10.1109\/BigDataCongress.2015.64"},{"key":"388_CR34","unstructured":"Ambari. https:\/\/ambari.apache.org\/. Accessed 15 July 2020."},{"key":"388_CR35","doi-asserted-by":"crossref","unstructured":"Xiang L-H, Miao L, Zhang D-F, Chen F-P. Benefit of compression in hadoop: A case study of improving io performance on hadoop. In: Proceedings of the 6th international asia conference on industrial engineering and management innovation. Berlin: Springer; 2016. p. 879\u201390.","DOI":"10.2991\/978-94-6239-148-2_87"},{"key":"388_CR36","unstructured":"O\u2019Malley O. Terabyte sort on apache hadoop. Report, Yahoo!; 2008. http:\/\/sortbenchmark.org\/YahooHadoop.pdf. Accessed 15 July 2020."},{"key":"388_CR37","unstructured":"Apache Tuning Spark 1.1.1. https:\/\/spark.apache.org\/docs\/1.1.1\/tuning.html. Accessed 15 July 2020."},{"issue":"3","key":"388_CR38","doi-asserted-by":"publisher","first-page":"630","DOI":"10.1007\/s10766-017-0513-2","volume":"46","author":"MM Rathore","year":"2018","unstructured":"Rathore MM, Son H, Ahmad A, Paul A, Jeon G. Real-time big data stream processing using gpu with spark over hadoop ecosystem. Int J Parallel Progr. 2018;46(3):630\u201346.","journal-title":"Int J Parallel Progr"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-020-00388-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1186\/s40537-020-00388-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-020-00388-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,12,14]],"date-time":"2020-12-14T07:59:46Z","timestamp":1607932786000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-020-00388-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,12]]},"references-count":38,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2020,12]]}},"alternative-id":["388"],"URL":"https:\/\/doi.org\/10.1186\/s40537-020-00388-5","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-43526\/v1","asserted-by":"object"},{"id-type":"doi","id":"10.21203\/rs.3.rs-43526\/v2","asserted-by":"object"}]},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,12]]},"assertion":[{"value":"30 July 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 November 2020","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 December 2020","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"Not applicable.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"110"}}