{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2023,8,15]],"date-time":"2023-08-15T15:02:03Z","timestamp":1692111723241},"reference-count":47,"publisher":"Association for Computing Machinery (ACM)","issue":"5","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2020,1]]},"abstract":"<jats:p>Big-data systems have gained significant momentum, and Apache Spark is becoming a de-facto standard for modern data analytics. Spark relies on SQL query compilation to optimize the execution performance of analytical workloads on a variety of data sources. Despite its scalable architecture, Spark's SQL code generation suffers from significant runtime overheads related to data access and de-serialization. Such performance penalty can be significant, especially when applications operate on human-readable data formats such as CSV or JSON.<\/jats:p>\n          <jats:p>In this paper we present a new approach to query compilation that overcomes these limitations by relying on run-time profiling and dynamic code generation. Our new SQL compiler for Spark produces highly-efficient machine code, leading to speedups of up to 4.4x on the TPC-H benchmark with textual-form data formats such as CSV or JSON.<\/jats:p>","DOI":"10.14778\/3377369.3377382","type":"journal-article","created":{"date-parts":[[2020,2,19]],"date-time":"2020-02-19T18:58:53Z","timestamp":1582138733000},"page":"754-767","source":"Crossref","is-referenced-by-count":7,"title":["Dynamic speculative optimizations for SQL compilation in Apache Spark"],"prefix":"10.14778","volume":"13","author":[{"given":"Filippo","family":"Schiavio","sequence":"first","affiliation":[{"name":"Universit\u00e0 della Svizzera italiana (USI), Switzerland"}]},{"given":"Daniele","family":"Bonetta","sequence":"additional","affiliation":[{"name":"VM Research Group Oracle Labs"}]},{"given":"Walter","family":"Binder","sequence":"additional","affiliation":[{"name":"Universit\u00e0 della Svizzera italiana (USI), Switzerland"}]}],"member":"320","published-online":{"date-parts":[[2020,2,19]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824080"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2742797"},{"issue":"12","key":"e_1_2_1_3_1","first-page":"1778","article-title":"FAD.js","volume":"10","author":"Bonetta D.","year":"2017","unstructured":"D. Bonetta and M. Brantner . FAD.js : Fast JSON Data Access Using JIT-based Speculative Optimizations. PVLDB , 10 ( 12 ): 1778 -- 1789 , 2017 . D. Bonetta and M. Brantner. FAD.js: Fast JSON Data Access Using JIT-based Speculative Optimizations. PVLDB, 10(12):1778--1789, 2017.","journal-title":"Fast JSON Data Access Using JIT-based Speculative Optimizations. PVLDB"},{"key":"e_1_2_1_4_1","first-page":"28","volume-title":"IEEE Data Eng. Bull.","author":"Carbone P.","year":"2015","unstructured":"P. Carbone , A. Katsifodimos , S. Ewen , V. Markl , S. Haridi , and K. Tzoumas . Apache Flink: Stream and Batch Processing in a Single Engine . IEEE Data Eng. Bull. , pages 28 -- 38 , 2015 . P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache Flink: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull., pages 28--38, 2015."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.14778\/3007263.3007277"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2016.7482079"},{"key":"e_1_2_1_7_1","volume-title":"Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop","year":"2019","unstructured":"Databricks. Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop , 2019 . https:\/\/databricks.com\/blog\/2016\/05\/23\/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html. Databricks. Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop, 2019. https:\/\/databricks.com\/blog\/2016\/05\/23\/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html."},{"key":"e_1_2_1_8_1","volume-title":"Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale","year":"2019","unstructured":"Databricks. Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale , 2019 . https:\/\/databricks.com\/blog\/2015\/04\/13\/deep-dive-into-spark-sqls-catalyst-optimizer.html. Databricks. Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale, 2019. https:\/\/databricks.com\/blog\/2015\/04\/13\/deep-dive-into-spark-sqls-catalyst-optimizer.html."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2542142.2542143"},{"key":"e_1_2_1_10_1","first-page":"799","volume-title":"OSDI","author":"Essertel G. M.","year":"2018","unstructured":"G. M. Essertel , R. Y. Tahboub , J. M. Decker , K. J. Brown , K. Olukotun , and T. Rompf . Flare: Optimizing Apache Spark with Native Compilation for Scale-up Architectures and Medium-size Data . In OSDI , pages 799 -- 815 , 2018 . G. M. Essertel, R. Y. Tahboub, J. M. Decker, K. J. Brown, K. Olukotun, and T. Rompf. Flare: Optimizing Apache Spark with Native Compilation for Scale-up Architectures and Medium-size Data. In OSDI, pages 799--815, 2018."},{"key":"e_1_2_1_11_1","volume-title":"Welcome - Data Center Observatory --- ETH Zurich","author":"ETH","year":"2019","unstructured":"ETH DCO. Welcome - Data Center Observatory --- ETH Zurich , 2019 . https:\/\/wiki.dco.ethz.ch\/. ETH DCO. Welcome - Data Center Observatory --- ETH Zurich, 2019. https:\/\/wiki.dco.ethz.ch\/."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CYBER.2015.7288049"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.5555\/645841.670891"},{"key":"e_1_2_1_14_1","volume-title":"Performance comparison between Java and JNI for optimal implementation of computational micro-kernels. CoRR, abs\/1412.6765","author":"Halli N. A.","year":"2014","unstructured":"N. A. Halli , H.-P. Charles , and J.-F. M\u00e9haut . Performance comparison between Java and JNI for optimal implementation of computational micro-kernels. CoRR, abs\/1412.6765 , 2014 . N. A. Halli, H.-P. Charles, and J.-F. M\u00e9haut. Performance comparison between Java and JNI for optimal implementation of computational micro-kernels. CoRR, abs\/1412.6765, 2014."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2610531"},{"key":"e_1_2_1_16_1","volume-title":"CIDR","author":"Karpathiotakis M.","year":"2015","unstructured":"M. Karpathiotakis , I. Alagiannis , T. Heinis , M. Branco , and A. Ailamaki . Just-In-Time Data Virtualization: Lightweight Data Management with ViDa . CIDR , 2015 . M. Karpathiotakis, I. Alagiannis, T. Heinis, M. Branco, and A. Ailamaki. Just-In-Time Data Virtualization: Lightweight Data Management with ViDa. CIDR, 2015."},{"key":"e_1_2_1_17_1","volume-title":"CoRR","author":"Kashuba A.","year":"2018","unstructured":"A. Kashuba and H. M\u00fchleisen . Automatic Generation of a Hybrid Query Execution Engine . CoRR , 2018 . A. Kashuba and H. M\u00fchleisen. Automatic Generation of a Hybrid Query Execution Engine. CoRR, 2018."},{"issue":"13","key":"e_1_2_1_18_1","first-page":"2209","article-title":"Everything You Always Wanted to Know About Compiled and Vectorized Queries but Were Afraid to Ask","volume":"11","author":"Kersten T.","year":"2018","unstructured":"T. Kersten , V. Leis , A. Kemper , T. Neumann , A. Pavlo , and P. Boncz . Everything You Always Wanted to Know About Compiled and Vectorized Queries but Were Afraid to Ask . PVLDB , 11 ( 13 ): 2209 -- 2222 , 2018 . T. Kersten, V. Leis, A. Kemper, T. Neumann, A. Pavlo, and P. Boncz. Everything You Always Wanted to Know About Compiled and Vectorized Queries but Were Afraid to Ask. PVLDB, 11(13):2209--2222, 2018.","journal-title":"PVLDB"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2018.00027"},{"issue":"10","key":"e_1_2_1_20_1","first-page":"1118","article-title":"Mison","volume":"10","author":"Li Y.","year":"2017","unstructured":"Y. Li , N. R. Katsipoulakis , B. Chandramouli , J. Goldstein , and D. Kossmann . Mison : A Fast JSON Parser for Data Analytics. PVLDB , 10 ( 10 ): 1118 -- 1129 , 2017 . Y. Li, N. R. Katsipoulakis, B. Chandramouli, J. Goldstein, and D. Kossmann. Mison: A Fast JSON Parser for Data Analytics. PVLDB, 10(10):1118--1129, 2017.","journal-title":"A Fast JSON Parser for Data Analytics. PVLDB"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3267809.3267814"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.14778\/3151113.3151114"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.14778\/2002938.2002940"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173162.3173200"},{"key":"e_1_2_1_25_1","volume-title":"About Java Flight Recorder","year":"2019","unstructured":"Oracle. About Java Flight Recorder , 2019 . https:\/\/docs.oracle.com\/javacomponents\/jmc-5-4\/jfr-runtime-guide\/about.htm#JFRUH170. Oracle. About Java Flight Recorder, 2019. https:\/\/docs.oracle.com\/javacomponents\/jmc-5-4\/jfr-runtime-guide\/about.htm#JFRUH170."},{"key":"e_1_2_1_26_1","volume-title":"com.oracle.truffle.api (GraalVM Truffle Java API Reference)","year":"2019","unstructured":"Oracle. com.oracle.truffle.api (GraalVM Truffle Java API Reference) , 2019 . https:\/\/www.graalvm.org\/truffle\/javadoc\/com\/oracle\/truffle\/api\/package-summary.html. Oracle. com.oracle.truffle.api (GraalVM Truffle Java API Reference), 2019. https:\/\/www.graalvm.org\/truffle\/javadoc\/com\/oracle\/truffle\/api\/package-summary.html."},{"key":"e_1_2_1_27_1","volume-title":"Java Native Interface Specification Contents","year":"2019","unstructured":"Oracle. Java Native Interface Specification Contents , 2019 . https:\/\/docs.oracle.com\/javase\/8\/docs\/technotes\/guides\/jni\/spec\/jniTOC.html. Oracle. Java Native Interface Specification Contents, 2019. https:\/\/docs.oracle.com\/javase\/8\/docs\/technotes\/guides\/jni\/spec\/jniTOC.html."},{"key":"e_1_2_1_28_1","volume-title":"Database --- Cloud Database --- Oracle","author":"Oracle","year":"2019","unstructured":"Oracle RDBMS. Database --- Cloud Database --- Oracle , 2019 . https:\/\/www.oracle.com\/it\/database\/. Oracle RDBMS. Database --- Cloud Database --- Oracle, 2019. https:\/\/www.oracle.com\/it\/database\/."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.14778\/3236187.3236207"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1868294.1868314"},{"key":"e_1_2_1_31_1","first-page":"2","volume-title":"SIMD Intrinsics on Managed Language Runtimes. In CGO 2018","author":"Stojanov A.","year":"2018","unstructured":"A. Stojanov , I. Toskov , T. Rompf , and M. Puschel . SIMD Intrinsics on Managed Language Runtimes. In CGO 2018 , pages 2 -- 15 , 2018 . A. Stojanov, I. Toskov, T. Rompf, and M. Puschel. SIMD Intrinsics on Managed Language Runtimes. In CGO 2018, pages 2--15, 2018."},{"key":"e_1_2_1_32_1","unstructured":"Team Apache Hadoop. Apache Hadoop 2019. https:\/\/hadoop.apache.org\/.  Team Apache Hadoop. Apache Hadoop 2019. https:\/\/hadoop.apache.org\/."},{"key":"e_1_2_1_33_1","unstructured":"Team Apache Hadoop. Apache Hadoop 2.9.2; The YARN Timeline Service v.2 2019. http:\/\/hadoop.apache.org\/docs\/stable\/hadoop-yarn\/hadoop-yarn-site\/TimelineServiceV2.html.  Team Apache Hadoop. Apache Hadoop 2.9.2; The YARN Timeline Service v.2 2019. http:\/\/hadoop.apache.org\/docs\/stable\/hadoop-yarn\/hadoop-yarn-site\/TimelineServiceV2.html."},{"key":"e_1_2_1_34_1","unstructured":"Team Apache Spark. ExperimentalMethods (Spark 2.4.0 JavaDoc) 2019. https:\/\/spark.apache.org\/docs\/2.4.0\/api\/java\/org\/apache\/spark\/sql\/ExperimentalMethods.html#extraOptimizations().  Team Apache Spark. ExperimentalMethods (Spark 2.4.0 JavaDoc) 2019. https:\/\/spark.apache.org\/docs\/2.4.0\/api\/java\/org\/apache\/spark\/sql\/ExperimentalMethods.html#extraOptimizations()."},{"key":"e_1_2_1_35_1","volume-title":"spark\/filters.scala at v2.4.0 apache\/spark","author":"Spark Team Apache","year":"2019","unstructured":"Team Apache Spark . spark\/filters.scala at v2.4.0 apache\/spark , 2019 . https:\/\/github.com\/apache\/spark\/blob\/v2.4.0\/sql\/core\/src\/main\/scala\/org\/apache\/spark\/sql\/sources\/filters.scala. Team Apache Spark. spark\/filters.scala at v2.4.0 apache\/spark, 2019. https:\/\/github.com\/apache\/spark\/blob\/v2.4.0\/sql\/core\/src\/main\/scala\/org\/apache\/spark\/sql\/sources\/filters.scala."},{"key":"e_1_2_1_36_1","unstructured":"Team Apache Spark. Tuning - Spark 2.4.0 Documentation 2019. https:\/\/spark.apache.org\/docs\/latest\/tuning.html#data-locality.  Team Apache Spark. Tuning - Spark 2.4.0 Documentation 2019. https:\/\/spark.apache.org\/docs\/latest\/tuning.html#data-locality."},{"key":"e_1_2_1_37_1","volume-title":"Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale","author":"Databricks Team","year":"2019","unstructured":"Team Databricks . Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale , 2019 . https:\/\/databricks.com\/session\/spark-sql-adaptive-execution-unleashes-the-power-of-cluster-in-large-scale. Team Databricks. Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale, 2019. https:\/\/databricks.com\/session\/spark-sql-adaptive-execution-unleashes-the-power-of-cluster-in-large-scale."},{"key":"e_1_2_1_38_1","unstructured":"TeamMapDB. MapDB 2019. http:\/\/www.mapdb.org\/.  TeamMapDB. MapDB 2019. http:\/\/www.mapdb.org\/."},{"key":"e_1_2_1_39_1","unstructured":"Team Parquet. Apache Parquet 2019. https:\/\/parquet.apache.org\/.  Team Parquet. Apache Parquet 2019. https:\/\/parquet.apache.org\/."},{"key":"e_1_2_1_40_1","volume-title":"Presto --- Distributed SQL Query Engine for Big Data","author":"Team","year":"2019","unstructured":"Team PrestoDB. Presto --- Distributed SQL Query Engine for Big Data , 2019 . http:\/\/prestodb.github.io\/. Team PrestoDB. Presto --- Distributed SQL Query Engine for Big Data, 2019. http:\/\/prestodb.github.io\/."},{"key":"e_1_2_1_41_1","unstructured":"TPC. TPC-H - Homepage 2019. http:\/\/www.tpc.org\/tpch\/.  TPC. TPC-H - Homepage 2019. http:\/\/www.tpc.org\/tpch\/."},{"key":"e_1_2_1_42_1","first-page":"31","volume-title":"IEEE Data Eng. Bull.","author":"Wanderman-Milne S.","year":"2014","unstructured":"S. Wanderman-Milne and N. Li . Runtime Code Generation in Cloudera Impala . IEEE Data Eng. Bull. , pages 31 -- 37 , 2014 . S. Wanderman-Milne and N. Li. Runtime Code Generation in Cloudera Impala. IEEE Data Eng. Bull., pages 31--37, 2014."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/2384716.2384723"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3062341.3062381"},{"key":"e_1_2_1_45_1","doi-asserted-by":"crossref","first-page":"187","DOI":"10.1145\/2509578.2509581","volume-title":"Onward!","author":"W\u00fcrthinger T.","year":"2013","unstructured":"T. W\u00fcrthinger , C. Wimmer , A. W\u00f6\u00df , L. Stadler , G. Duboscq , C. Humer , G. Richards , D. Simon , and M. Wolczko . One VM to Rule Them All . In Onward! , pages 187 -- 204 , 2013 . T. W\u00fcrthinger, C. Wimmer, A. W\u00f6\u00df, L. Stadler, G. Duboscq, C. Humer, G. Richards, D. Simon, and M. Wolczko. One VM to Rule Them All. In Onward!, pages 187--204, 2013."},{"key":"e_1_2_1_46_1","first-page":"2","volume-title":"NSDI","author":"Zaharia M.","year":"2012","unstructured":"M. Zaharia , M. Chowdhury , T. Das , A. Dave , J. Ma , M. McCauley , M. J. Franklin , S. Shenker , and I. Stoica . Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing . In NSDI , pages 2 -- 2 , 2012 . M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In NSDI, pages 2--2, 2012."},{"key":"e_1_2_1_47_1","first-page":"10","volume-title":"HotCloud","author":"Zaharia M.","year":"2010","unstructured":"M. Zaharia , M. Chowdhury , M. J. Franklin , S. Shenker , and I. Stoica . Spark: Cluster Computing with Working Sets . In HotCloud , pages 10 -- 10 , 2010 . M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In HotCloud, pages 10--10, 2010."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3377369.3377382","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T09:34:58Z","timestamp":1672220098000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3377369.3377382"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,1]]},"references-count":47,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2020,1]]}},"alternative-id":["10.14778\/3377369.3377382"],"URL":"https:\/\/doi.org\/10.14778\/3377369.3377382","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2020,1]]}}}