{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,24]],"date-time":"2026-06-24T16:00:36Z","timestamp":1782316836527,"version":"3.54.5"},"reference-count":19,"publisher":"Association for Computing Machinery (ACM)","issue":"13","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2015,9]]},"abstract":"<jats:p>MapReduce and Spark are two very popular open source cluster computing frameworks for large scale data analytics. These frameworks hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: shuffle, execution model, and caching, by using a set of important analytic workloads. To conduct a detailed analysis, we developed two profiling tools: (1) We correlate the task execution plan with the resource utilization for both MapReduce and Spark, and visually present this correlation; (2) We provide a break-down of the task execution time for in-depth analysis. Through detailed experiments, we quantify the performance differences between MapReduce and Spark. Furthermore, we attribute these performance differences to different components which are architected differently in the two frameworks. We further expose the source of these performance differences by using a set of micro-benchmark experiments. Overall, our experiments show that Spark is about 2.5x, 5x, and 5x faster than MapReduce, for Word Count, k-means, and PageRank, respectively. The main causes of these speedups are the efficiency of the hash-based aggregation component for combine, as well as reduced CPU and disk overheads due to RDD caching in Spark. An exception to this is the Sort workload, for which MapReduce is 2x faster than Spark. We show that MapReduce's execution model is more efficient for shuffling data than Spark, thus making Sort run faster on MapReduce.<\/jats:p>","DOI":"10.14778\/2831360.2831365","type":"journal-article","created":{"date-parts":[[2015,9,30]],"date-time":"2015-09-30T12:16:36Z","timestamp":1443615396000},"page":"2110-2121","source":"Crossref","is-referenced-by-count":163,"title":["Clash of the titans"],"prefix":"10.14778","volume":"8","author":[{"given":"Juwei","family":"Shi","sequence":"first","affiliation":[{"name":"Renmin University of China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yunjie","family":"Qiu","sequence":"additional","affiliation":[{"name":"IBM Research, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Umar Farooq","family":"Minhas","sequence":"additional","affiliation":[{"name":"IBM Almaden Research Center"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Limei","family":"Jiao","sequence":"additional","affiliation":[{"name":"IBM Research, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chen","family":"Wang","sequence":"additional","affiliation":[{"name":"Tsinghua University"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Berthold","family":"Reinwald","sequence":"additional","affiliation":[{"name":"IBM Almaden Research Center"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Fatma","family":"\u00d6zcan","sequence":"additional","affiliation":[{"name":"IBM Almaden Research Center"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2015,9]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Apache Hadoop. http:\/\/hadoop.apache.org\/.  Apache Hadoop. http:\/\/hadoop.apache.org\/."},{"key":"e_1_2_1_2_1","unstructured":"Apache Mahout. https:\/\/mahout.apache.org\/.  Apache Mahout. https:\/\/mahout.apache.org\/."},{"key":"e_1_2_1_3_1","unstructured":"HDFS caching. http:\/\/hadoop.apache.org\/docs\/current\/hadoop-project-dist\/hadoop-hdfs\/CentralizedCacheManagement.html.  HDFS caching. http:\/\/hadoop.apache.org\/docs\/current\/hadoop-project-dist\/hadoop-hdfs\/CentralizedCacheManagement.html."},{"key":"e_1_2_1_4_1","unstructured":"HPROF\n  : A heap\/cpu profiling tool. http:\/\/docs.oracle.com\/javase\/7\/docs\/technotes\/samples\/hprof.html.  HPROF: A heap\/cpu profiling tool. http:\/\/docs.oracle.com\/javase\/7\/docs\/technotes\/samples\/hprof.html."},{"key":"e_1_2_1_5_1","unstructured":"RRDtool. http:\/\/oss.oetiker.ch\/rrdtool\/.  RRDtool. http:\/\/oss.oetiker.ch\/rrdtool\/."},{"key":"e_1_2_1_6_1","unstructured":"Spark wins 2014 graysort competition. http:\/\/databricks.com\/blog\/2014\/11\/05\/spark-officially-sets-a-new-record-in-large-scale-sorting.html.  Spark wins 2014 graysort competition. http:\/\/databricks.com\/blog\/2014\/11\/05\/spark-officially-sets-a-new-record-in-large-scale-sorting.html."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1327452.1327492"},{"key":"e_1_2_1_8_1","volume-title":"Functional Programming: Application and Implementation","author":"Henderson P.","year":"1980","unstructured":"P. Henderson . Functional Programming: Application and Implementation . Prentice-Hall International London , 1980 . P. Henderson. Functional Programming: Application and Implementation. Prentice-Hall International London, 1980."},{"issue":"11","key":"e_1_2_1_9_1","first-page":"1111","article-title":"Profiling, what-if analysis, and cost-based optimization of mapreduce programs","volume":"4","author":"Herodotou H.","year":"2011","unstructured":"H. Herodotou and S. Babu . Profiling, what-if analysis, and cost-based optimization of mapreduce programs . VLDB , 4 ( 11 ): 1111 -- 1122 , 2011 . H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. VLDB, 4(11):1111--1122, 2011.","journal-title":"VLDB"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDEW.2010.5452747"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2670979.2670985"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807184"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2004.04.001"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1298306.1298311"},{"key":"e_1_2_1_15_1","volume-title":"Sort Benchmark","author":"Malley O.","year":"2009","unstructured":"O. O Malley and A. C. Murthy . Winning a 60 second dash with a yellow elephant . Sort Benchmark , 2009 . O. OMalley and A. C. Murthy. Winning a 60 second dash with a yellow elephant. Sort Benchmark, 2009."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.14778\/2733004.2733005"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2523616.2523633"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/SCC.2010.41"},{"key":"e_1_2_1_19_1","volume-title":"NSDI","author":"Zaharia M.","year":"2012","unstructured":"M. Zaharia , M. Chowdhury , T. Das , A. Dave , J. Ma , M. McCauley , M. J. Franklin , S. Shenker , and I. Stoica . Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing . In NSDI , 2012 . M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/2831360.2831365","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T09:28:32Z","timestamp":1672219712000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/2831360.2831365"}},"subtitle":["MapReduce vs. Spark for large scale data analytics"],"short-title":[],"issued":{"date-parts":[[2015,9]]},"references-count":19,"journal-issue":{"issue":"13","published-print":{"date-parts":[[2015,9]]}},"alternative-id":["10.14778\/2831360.2831365"],"URL":"https:\/\/doi.org\/10.14778\/2831360.2831365","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2015,9]]}}}