{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,30]],"date-time":"2026-05-30T02:09:22Z","timestamp":1780106962903,"version":"3.54.0"},"reference-count":26,"publisher":"Association for Computing Machinery (ACM)","issue":"10","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2013,8,26]]},"abstract":"<jats:p>We analyze Hadoop workloads from three di?erent research clusters from a user-centric perspective. The goal is to better understand data scientists' use of the system and how well the use of the system matches its design. Our analysis suggests that Hadoop usage is still in its adolescence. We see underuse of Hadoop features, extensions, and tools. We see significant diversity in resource usage and application styles, including some interactive and iterative workloads, motivating new tools in the ecosystem. We also observe significant opportunities for optimizations of these workloads. We find that job customization and configuration are used in a narrow scope, suggesting the future pursuit of automatic tuning systems. Overall, we present the first user-centered measurement study of Hadoop and find significant opportunities for improving its efficient use for data scientists.<\/jats:p>","DOI":"10.14778\/2536206.2536213","type":"journal-article","created":{"date-parts":[[2014,6,24]],"date-time":"2014-06-24T12:17:57Z","timestamp":1403612277000},"page":"853-864","source":"Crossref","is-referenced-by-count":77,"title":["Hadoop's adolescence"],"prefix":"10.14778","volume":"6","author":[{"given":"Kai","family":"Ren","sequence":"first","affiliation":[{"name":"Carnegie Mellon University"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"YongChul","family":"Kwon","sequence":"additional","affiliation":[{"name":"Microsoft"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Magdalena","family":"Balazinska","sequence":"additional","affiliation":[{"name":"University of Washington"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Bill","family":"Howe","sequence":"additional","affiliation":[{"name":"University of Washington"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2013,8]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Yahoo! reaches for the stars with M45 supercomputing project. http:\/\/research.yahoo.com\/node\/1884.  Yahoo! reaches for the stars with M45 supercomputing project. http:\/\/research.yahoo.com\/node\/1884."},{"key":"e_1_2_1_2_1","volume-title":"OSDI","author":"Ananthanarayanan G.","year":"2010","unstructured":"G. Ananthanarayanan Reining in the outliers in Map-Reduce clusters using Mantri . In OSDI , 2010 . G. Ananthanarayanan et al. Reining in the outliers in Map-Reduce clusters using Mantri. In OSDI, 2010."},{"key":"e_1_2_1_3_1","volume-title":"NSDI","author":"Ananthanarayanan G.","year":"2012","unstructured":"G. Ananthanarayanan : Coordinated memory caching for parallel jobs . In NSDI , 2012 . G. Ananthanarayanan et al. PACMan: Coordinated memory caching for parallel jobs. In NSDI, 2012."},{"key":"e_1_2_1_4_1","unstructured":"Apache Foundation. Hadoop. http:\/\/hadoop.apache.org\/.  Apache Foundation. Hadoop. http:\/\/hadoop.apache.org\/."},{"key":"e_1_2_1_5_1","unstructured":"Apache Foundation. Mahout: Scalable machine learning and data mining. http:\/\/mahout.apache.org\/.  Apache Foundation. Mahout: Scalable machine learning and data mining. http:\/\/mahout.apache.org\/."},{"key":"e_1_2_1_6_1","first-page":"996","volume-title":"ICDE","year":"2010","unstructured":"Ashish Thusoo et. al. Hive : a petabyte scale data warehouse using Hadoop . In ICDE , pages 996 - 1005 , 2010 . Ashish Thusoo et. al. Hive: a petabyte scale data warehouse using Hadoop. In ICDE, pages 996-1005, 2010."},{"key":"e_1_2_1_7_1","first-page":"137","volume-title":"SoCC","author":"Babu S.","year":"2010","unstructured":"S. Babu . Towards automatic optimization of mapreduce programs . In SoCC , pages 137 - 142 , 2010 . S. Babu. Towards automatic optimization of mapreduce programs. In SoCC, pages 137-142, 2010."},{"key":"e_1_2_1_8_1","unstructured":"D. Borthakur. The Hadoop distributed file system: Architecture and design. http:\/\/lucene.apache.org\/hadoop\/hdfs_design.pdf 2007.  D. Borthakur. The Hadoop distributed file system: Architecture and design. http:\/\/lucene.apache.org\/hadoop\/hdfs_design.pdf 2007."},{"key":"e_1_2_1_9_1","first-page":"390","volume-title":"MASCOTS","author":"Chen Y.","unstructured":"Y. Chen The case for evaluating MapReduce performance using workload suites . In MASCOTS , pages 390 - 399 . Y. Chen et al. The case for evaluating MapReduce performance using workload suites. In MASCOTS, pages 390-399."},{"issue":"12","key":"e_1_2_1_10_1","first-page":"1802","article-title":"Interactive query processing in big data systems: A cross-industry study of MapReduce workloads","volume":"5","author":"Chen Y.","year":"2012","unstructured":"Y. Chen Interactive query processing in big data systems: A cross-industry study of MapReduce workloads . PVLDB , 5 ( 12 ): 1802 - 1813 , 2012 . Y. Chen et al. Interactive query processing in big data systems: A cross-industry study of MapReduce workloads. PVLDB, 5(12):1802-1813, 2012.","journal-title":"PVLDB"},{"key":"e_1_2_1_11_1","unstructured":"Concurrent Inc. Cascading. http:\/\/www.cascading.org\/ 2012.  Concurrent Inc. Cascading. http:\/\/www.cascading.org\/ 2012."},{"key":"e_1_2_1_12_1","volume-title":"OSDI","author":"Dean J.","year":"2004","unstructured":"J. Dean and S. Ghemawat . MapReduce: Simplified data processing on large clusters . In OSDI , 2004 . J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004."},{"issue":"11","key":"e_1_2_1_14_1","first-page":"1111","article-title":"Profiling, what-if analysis, and cost-based optimization of MapReduce programs","volume":"4","author":"Herodotou H.","year":"2011","unstructured":"H. Herodotou and S. Babu . Profiling, what-if analysis, and cost-based optimization of MapReduce programs . PVLDB , 4 ( 11 ): 1111 - 1122 , 2011 . H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB, 4(11):1111-1122, 2011.","journal-title":"PVLDB"},{"key":"e_1_2_1_15_1","first-page":"229","volume-title":"ICDM","author":"Kang U.","year":"2009","unstructured":"U. Kang : A peta-scale graph mining system implementation and observations . In ICDM , pages 229 - 238 , 2009 . U. Kang et al. PEGASUS: A peta-scale graph mining system implementation and observations. In ICDM, pages 229-238, 2009."},{"key":"e_1_2_1_16_1","first-page":"94","volume-title":"CCGRID","author":"Kavulya S.","year":"2010","unstructured":"S. Kavulya An analysis of traces from a production MapReduce cluster . In CCGRID , pages 94 - 103 , 2010 . S. Kavulya et al. An analysis of traces from a production MapReduce cluster. In CCGRID, pages 94-103, 2010."},{"key":"e_1_2_1_17_1","volume-title":"HotOS","author":"Ke Q.","year":"2011","unstructured":"Q. Ke Optimizing data partitioning for data-parallel computing . In HotOS , 2011 . Q. Ke et al. Optimizing data partitioning for data-parallel computing. In HotOS, 2011."},{"issue":"7","key":"e_1_2_1_18_1","first-page":"598","article-title":"Perfxplain: Debugging mapreduce job performance","volume":"5","author":"Khoussainova N.","year":"2012","unstructured":"N. Khoussainova Perfxplain: Debugging mapreduce job performance . PVLDB , 5 ( 7 ): 598 - 609 , 2012 . N. Khoussainova et al. Perfxplain: Debugging mapreduce job performance. PVLDB, 5(7):598-609, 2012.","journal-title":"PVLDB"},{"key":"e_1_2_1_19_1","first-page":"25","volume-title":"SIGMOD","author":"Kwon Y.","year":"2012","unstructured":"Y. Kwon : mitigating skew in mapreduce applications . In SIGMOD , pages 25 - 36 , 2012 . Y. Kwon et al. SkewTune: mitigating skew in mapreduce applications. In SIGMOD, pages 25-36, 2012."},{"issue":"2","key":"e_1_2_1_20_1","doi-asserted-by":"crossref","first-page":"129","DOI":"10.1109\/TIT.1982.1056489","article-title":"Least squares quantization","volume":"28","author":"Lloyd S. P.","year":"1982","unstructured":"S. P. Lloyd . Least squares quantization in PCM. IEEE Transactions on Information Theory , 28 ( 2 ): 129 - 137 , 1982 . S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129-137, 1982.","journal-title":"PCM. IEEE Transactions on Information Theory"},{"key":"e_1_2_1_21_1","first-page":"135","volume-title":"SIGMOD","author":"Malewicz G.","year":"2010","unstructured":"G. Malewicz : a system for large-scale graph processing . In SIGMOD , pages 135 - 146 , 2010 . G. Malewicz et al. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135-146, 2010."},{"key":"e_1_2_1_22_1","first-page":"1099","volume-title":"SIGMOD","author":"Olston C.","year":"2008","unstructured":"C. Olston : a not-so-foreign language for data processing . In SIGMOD , pages 1099 - 1110 , 2008 . C. Olston et al. Pig Latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099-1110, 2008."},{"key":"e_1_2_1_23_1","first-page":"245","volume-title":"SIGMOD","author":"Olston C.","year":"2009","unstructured":"C. Olston Generating example data for dataflow programs . In SIGMOD , pages 245 - 256 , 2009 . C. Olston et al. Generating example data for dataflow programs. In SIGMOD, pages 245-256, 2009."},{"key":"e_1_2_1_24_1","first-page":"165","volume-title":"SIGMOD","author":"Pavlo A.","year":"2009","unstructured":"A. Pavlo A comparison of approaches to large-scale data analysis . In SIGMOD , pages 165 - 178 , 2009 . A. Pavlo et al. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165-178, 2009."},{"key":"e_1_2_1_25_1","volume-title":"A Scalar productivity framework for Hadoop. https:\/\/github.com\/NICTA\/scoobi","year":"2012","unstructured":"Scoobi Team. A Scalar productivity framework for Hadoop. https:\/\/github.com\/NICTA\/scoobi , 2012 . Scoobi Team. A Scalar productivity framework for Hadoop. https:\/\/github.com\/NICTA\/scoobi, 2012."},{"key":"e_1_2_1_26_1","first-page":"1","volume-title":"SoCC","author":"Sharma B.","year":"2011","unstructured":"B. Sharma Modeling and synthesizing task placement constraints in Google compute clusters . In SoCC , pages 3: 1 - 3 :14, 2011 . B. Sharma et al. Modeling and synthesizing task placement constraints in Google compute clusters. In SoCC, pages 3:1-3:14, 2011."},{"key":"e_1_2_1_27_1","doi-asserted-by":"crossref","first-page":"420","DOI":"10.1145\/2247596.2247646","volume-title":"EDBT","author":"Vernica R.","year":"2012","unstructured":"R. Vernica Adaptive MapReduce using situation-aware mappers . In EDBT , pages 420 - 431 , 2012 . R. Vernica et al. Adaptive MapReduce using situation-aware mappers. In EDBT, pages 420-431, 2012."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/2536206.2536213","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T10:04:32Z","timestamp":1672221872000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/2536206.2536213"}},"subtitle":["an analysis of Hadoop usage in scientific workloads"],"short-title":[],"issued":{"date-parts":[[2013,8]]},"references-count":26,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2013,8,26]]}},"alternative-id":["10.14778\/2536206.2536213"],"URL":"https:\/\/doi.org\/10.14778\/2536206.2536213","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2013,8]]}}}