{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,27]],"date-time":"2025-11-27T10:39:10Z","timestamp":1764239950604},"reference-count":16,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2015,8]]},"abstract":"<jats:p>Apache Spark is one of the most widely used open source processing engines for big data, with rich language-integrated APIs and a wide range of libraries. Over the past two years, our group has worked to deploy Spark to a wide range of organizations through consulting relationships as well as our hosted service, Databricks. We describe the main challenges and requirements that appeared in taking Spark to a wide set of users, and usability and performance improvements we have made to the engine in response.<\/jats:p>","DOI":"10.14778\/2824032.2824080","type":"journal-article","created":{"date-parts":[[2015,9,16]],"date-time":"2015-09-16T12:18:17Z","timestamp":1442405897000},"page":"1840-1843","source":"Crossref","is-referenced-by-count":81,"title":["Scaling spark in the real world"],"prefix":"10.14778","volume":"8","author":[{"given":"Michael","family":"Armbrust","sequence":"first","affiliation":[{"name":"Databricks Inc."}]},{"given":"Tathagata","family":"Das","sequence":"additional","affiliation":[{"name":"Databricks Inc."}]},{"given":"Aaron","family":"Davidson","sequence":"additional","affiliation":[{"name":"Databricks Inc."}]},{"given":"Ali","family":"Ghodsi","sequence":"additional","affiliation":[{"name":"Databricks Inc."}]},{"given":"Andrew","family":"Or","sequence":"additional","affiliation":[{"name":"Databricks Inc."}]},{"given":"Josh","family":"Rosen","sequence":"additional","affiliation":[{"name":"Databricks Inc."}]},{"given":"Ion","family":"Stoica","sequence":"additional","affiliation":[{"name":"Databricks Inc."}]},{"given":"Patrick","family":"Wendell","sequence":"additional","affiliation":[{"name":"Databricks Inc."}]},{"given":"Reynold","family":"Xin","sequence":"additional","affiliation":[{"name":"Databricks Inc."}]},{"given":"Matei","family":"Zaharia","sequence":"additional","affiliation":[{"name":"MIT CSAIL"}]}],"member":"320","published-online":{"date-parts":[[2015,8]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Apache Spark project. http:\/\/spark.apache.org.  Apache Spark project. http:\/\/spark.apache.org."},{"key":"e_1_2_1_2_1","volume-title":"SIGMOD","author":"Armbrust M.","year":"2015","unstructured":"M. Armbrust : relational data processing in Spark . In SIGMOD , 2015 . 10.1145\/2723372.2742797 M. Armbrust et al. Spark SQL: relational data processing in Spark. In SIGMOD, 2015. 10.1145\/2723372.2742797"},{"key":"e_1_2_1_3_1","volume-title":"OSDI","author":"Dean J.","year":"2004","unstructured":"J. Dean and S. Ghemawat . MapReduce: Simplified data processing on large clusters . In OSDI , 2004 . J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004."},{"issue":"9","key":"e_1_2_1_4_1","doi-asserted-by":"crossref","first-page":"941","DOI":"10.1038\/nmeth.3041","article-title":"Mapping brain activity at scale with cluster computing","volume":"11","author":"Freeman J.","year":"2014","unstructured":"J. Freeman , N. Vladimirov , T. Kawashima , Y. Mu , N. J. Sofroniew , D. V. Bennett , J. Rosen , C.-T. Yang , L. L. Looger , and M. B. Ahrens . Mapping brain activity at scale with cluster computing . Nature Methods , 11 ( 9 ): 941 -- 950 , Sept 2014 . J. Freeman, N. Vladimirov, T. Kawashima, Y. Mu, N. J. Sofroniew, D. V. Bennett, J. Rosen, C.-T. Yang, L. L. Looger, and M. B. Ahrens. Mapping brain activity at scale with cluster computing. Nature Methods, 11(9):941--950, Sept 2014.","journal-title":"Nature Methods"},{"key":"e_1_2_1_5_1","volume-title":"OSDI","author":"Gonzalez J. E.","year":"2014","unstructured":"J. E. Gonzalez : Graph processing in a distributed dataflow framework . In OSDI , 2014 . J. E. Gonzalez et al. GraphX: Graph processing in a distributed dataflow framework. In OSDI, 2014."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1272996.1273005"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807184"},{"key":"e_1_2_1_8_1","unstructured":"X. Meng etal ML pipelines: a new high-level API for MLlib. http:\/\/tinyurl.com\/spark-ml.  X. Meng et al. ML pipelines: a new high-level API for MLlib. http:\/\/tinyurl.com\/spark-ml."},{"key":"e_1_2_1_9_1","volume-title":"SIGMOD","author":"Nothaft F. A.","year":"2015","unstructured":"F. A. Nothaft , M. Massie , T. Danford , Z. Zhang , U. Laserson , C. Yeksigian , J. Kottalam , A. Ahuja , J. Hammerbacher , M. Linderman , M. J. Franklin , A. D. Joseph , and D. A. Patterson . Rethinking data-intensive science using scalable analytics systems . In SIGMOD , 2015 . 10.1145\/2723372.2742787 F. A. Nothaft, M. Massie, T. Danford, Z. Zhang, U. Laserson, C. Yeksigian, J. Kottalam, A. Ahuja, J. Hammerbacher, M. Linderman, M. J. Franklin, A. D. Joseph, and D. A. Patterson. Rethinking data-intensive science using scalable analytics systems. In SIGMOD, 2015. 10.1145\/2723372.2742787"},{"key":"e_1_2_1_10_1","unstructured":"Project Tungsten. https:\/\/databricks.com\/blog\/2015\/04\/28\/.  Project Tungsten. https:\/\/databricks.com\/blog\/2015\/04\/28\/."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544927"},{"key":"e_1_2_1_12_1","unstructured":"R. Xin etal GraySort on Apache Spark by Databricks. http:\/\/sortbenchmark.org\/ApacheSpark2014.pdf.  R. Xin et al. GraySort on Apache Spark by Databricks. http:\/\/sortbenchmark.org\/ApacheSpark2014.pdf."},{"key":"e_1_2_1_13_1","unstructured":"R. Xin and M. Zaharia. Lessons from running large scale Spark workloads. http:\/\/tinyurl.com\/large-scale-spark.  R. Xin and M. Zaharia. Lessons from running large scale Spark workloads. http:\/\/tinyurl.com\/large-scale-spark."},{"key":"e_1_2_1_14_1","volume-title":"NSDI","author":"Zaharia M.","year":"2012","unstructured":"M. Zaharia Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing . In NSDI , 2012 . M. Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012."},{"key":"e_1_2_1_15_1","volume-title":"SOSP","author":"Zaharia M.","year":"2013","unstructured":"M. Zaharia : Fault-tolerant streaming computation at scale . In SOSP , 2013 . 10.1145\/2517349.2522737 M. Zaharia et al. Discretized streams: Fault-tolerant streaming computation at scale. In SOSP, 2013. 10.1145\/2517349.2522737"},{"key":"e_1_2_1_16_1","volume-title":"SIGMOD","author":"Zeng K.","year":"2015","unstructured":"K. Zeng : Generalized online aggregation for interactive analysis on big data . In SIGMOD , 2015 . 10.1145\/2723372.2735381 K. Zeng et al. G-OLA: Generalized online aggregation for interactive analysis on big data. In SIGMOD, 2015. 10.1145\/2723372.2735381"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/2824032.2824080","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T10:05:23Z","timestamp":1672221923000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/2824032.2824080"}},"subtitle":["performance and usability"],"short-title":[],"issued":{"date-parts":[[2015,8]]},"references-count":16,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2015,8]]}},"alternative-id":["10.14778\/2824032.2824080"],"URL":"https:\/\/doi.org\/10.14778\/2824032.2824080","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2015,8]]}}}