{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,29]],"date-time":"2025-11-29T07:56:18Z","timestamp":1764402978652,"version":"3.37.3"},"reference-count":32,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2020,3,10]],"date-time":"2020-03-10T00:00:00Z","timestamp":1583798400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,3,10]],"date-time":"2020-03-10T00:00:00Z","timestamp":1583798400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100002946","name":"Deutsches Zentrum f\u00fcr Luft- und Raumfahrt","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100002946","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Distrib Parallel Databases"],"published-print":{"date-parts":[[2020,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Apache Spark jobs are often characterized by processing huge data sets and, therefore, require runtimes in the range of minutes to hours. Thus, being able to predict the runtime of such jobs would be useful not only to know when the job will finish, but also for scheduling purposes, to estimate monetary costs for cloud deployment, or to determine an appropriate cluster configuration, such as the number of nodes. However, predicting Spark job runtimes is much more challenging than for standard database queries: cluster configuration and parameters have a significant performance impact and jobs usually contain a lot of user-defined code making it difficult to estimate cardinalities and execution costs. In this paper, we present a gray-box modeling methodology for runtime prediction of Apache Spark jobs. Our approach comprises two steps: first, a white-box model for predicting the cardinalities of the input RDDs of each operator is built based on prior knowledge about the behavior and application parameters such as applied filters data, number of iterations, etc. In the second step, a black-box model for each task constructed by monitoring runtime metrics while varying allocated resources and input RDD cardinalities is used. We further show how to use this gray-box approach not only for predicting the runtime of a given job, but also as part of a decision model for reusing intermediate cached results of Spark jobs. Our methodology is validated with experimental evaluation showing a highly accurate prediction of the actual job runtime and a performance improvement if intermediate results can be reused.<\/jats:p>","DOI":"10.1007\/s10619-020-07286-y","type":"journal-article","created":{"date-parts":[[2020,3,10]],"date-time":"2020-03-10T11:02:47Z","timestamp":1583838167000},"page":"819-839","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":13,"title":["A gray-box modeling methodology for runtime prediction of Apache Spark jobs"],"prefix":"10.1007","volume":"38","author":[{"given":"Hani","family":"Al-Sayeh","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0172-8162","authenticated-orcid":false,"given":"Stefan","family":"Hagedorn","sequence":"additional","affiliation":[]},{"given":"Kai-Uwe","family":"Sattler","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2020,3,10]]},"reference":[{"key":"7286_CR1","unstructured":"Apache spark: Monitoring and instrumentation. https:\/\/spark.apache.org\/docs\/latest\/monitoring.html (2019). Accessed 22 Feb 2019"},{"key":"7286_CR2","unstructured":"Apache spark official website. https:\/\/spark.apache.org\/docs\/latest\/configuration.html (2019). Accessed 22 Feb 2019"},{"key":"7286_CR3","doi-asserted-by":"crossref","unstructured":"Abiteboul, S., Duschka, O.M.: Complexity of answering queries using materialized views. In: Proceedings of the PODS, pp. 254\u2013263, (1998)","DOI":"10.1145\/275487.275516"},{"key":"7286_CR4","doi-asserted-by":"crossref","unstructured":"Camacho-Rodr\u00edguez, J. et al.: PigReuse: a reuse-based optimizer for Pig Latin. Technical Report, Inria Saclay (2016)","DOI":"10.1145\/2983323.2983669"},{"key":"7286_CR5","doi-asserted-by":"crossref","unstructured":"Chao-Qiang, H. et\u00a0al.: RDDShare: reusing results of spark RDD. In: Proceedings of the DSC, pp. 370\u2013375, (2016)","DOI":"10.1109\/DSC.2016.80"},{"key":"7286_CR6","doi-asserted-by":"crossref","unstructured":"Chaudhuri, S., Narasayya, V., Ramamurthy, R.: Estimating progress of execution for sql queries. In: Proceedings of the SIGMOD, pp. 803\u2013814, (2004)","DOI":"10.1145\/1007568.1007659"},{"key":"7286_CR7","unstructured":"Chirkova, R., Halevy, A.Y., Suciu, D.: A formal perspective on the view selection problem. In: Proceedings of the VLDB, pp. 59\u201368, (2001)"},{"key":"7286_CR8","first-page":"586","volume":"5","author":"I Elghandour","year":"2012","unstructured":"Elghandour, I., Aboulnaga, A.: Restore: reusing results of mapreduce jobs. VLDB 5, 586\u2013597 (2012)","journal-title":"VLDB"},{"key":"7286_CR9","unstructured":"Hagedorn, S., Sattler, K.: Piglet: interactive and platform transparent analytics for rdf & dynamic data. In: Proceedings of the 25th international conference companion on world wide web, WWW 2016 Companion, pp. 187\u2013190, (2016)"},{"key":"7286_CR10","doi-asserted-by":"crossref","unstructured":"Hagedorn, S., Sattler, K.U.: Cost-based sharing and recycling of (intermediate) results in dataflow programs. In: Proceedings of the ADBIS, pp. 185\u2013199. Springer, (2018)","DOI":"10.1007\/978-3-319-98398-1_13"},{"issue":"4","key":"7286_CR11","doi-asserted-by":"publisher","first-page":"270","DOI":"10.1007\/s007780100054","volume":"10","author":"AY Halevy","year":"2001","unstructured":"Halevy, A.Y.: Answering queries using views: a survey. VLDB J. 10(4), 270\u2013294 (2001)","journal-title":"VLDB J."},{"issue":"2","key":"7286_CR12","doi-asserted-by":"publisher","first-page":"205","DOI":"10.1145\/235968.233333","volume":"25","author":"V Harinarayan","year":"1996","unstructured":"Harinarayan, V., Rajaraman, A., Ullman, J.D.: Implementing data cubes efficiently. SIGMOD Rec. 25(2), 205\u2013216 (1996)","journal-title":"SIGMOD Rec."},{"key":"7286_CR13","first-page":"261","volume":"11","author":"H Herodotou","year":"2011","unstructured":"Herodotou, H., Lim, H., et al.: Starfish: a self-tuning system for big data analytics. Cidr 11, 261\u2013272 (2011)","journal-title":"Cidr"},{"key":"7286_CR14","volume-title":"Computing Queries from Derived Relations: Theoretical Foundation","author":"PA Larson","year":"1987","unstructured":"Larson, P.A., Yang, H.Z.: Computing Queries from Derived Relations: Theoretical Foundation. Department of Computer Science, University of Waterloo, Waterloo (1987)"},{"key":"7286_CR15","doi-asserted-by":"crossref","unstructured":"Marco, V.S., Taylor, B. et\u00a0al.: Improving spark application throughput via memory aware task co-location: a mixture of experts approach. In: Proceedings of the Middleware, pp. 95\u2013108. ACM, (2017)","DOI":"10.1145\/3135974.3135984"},{"key":"7286_CR16","doi-asserted-by":"crossref","unstructured":"Morton, K., Balazinska, M., Grossman, D.: Paratimer: a progress indicator for mapreduce dags. In: Proceedings of the SIGMOD, pp. 507\u2013518. ACM, (2010)","DOI":"10.1145\/1807167.1807223"},{"key":"7286_CR17","unstructured":"Mysql english dictionary. https:\/\/sourceforge.net\/projects\/mysqlenglishdictionary\/ (2019). Accessed 22 Feb 2019"},{"issue":"1\u20132","key":"7286_CR18","first-page":"494","volume":"3","author":"T Nykiel","year":"2010","unstructured":"Nykiel, T., et al.: MRShare: sharing across multiple queries in MapReduce. PVLDB 3(1\u20132), 494\u2013505 (2010)","journal-title":"PVLDB"},{"key":"7286_CR19","doi-asserted-by":"crossref","unstructured":"Perez, L.L., Jermaine, C.M.: History-aware query optimization with materialized intermediate views. In: Proceedings of the ICDE, pp. 520\u2013531. IEEE, (2014)","DOI":"10.1109\/ICDE.2014.6816678"},{"issue":"14","key":"7286_CR20","first-page":"1678","volume":"6","author":"AD Popescu","year":"2013","unstructured":"Popescu, A.D., Balmin, A., et al.: Predict: towards predicting the runtime of large scale iterative analytics. PVLDB 6(14), 1678\u20131689 (2013)","journal-title":"PVLDB"},{"key":"7286_CR21","doi-asserted-by":"crossref","unstructured":"Selinger, P., Astrahan, M.M. et\u00a0al. Access path selection in a relational database management system. In: Proceedings of the SIGMOD, pp. 23\u201334. ACM, (1979)","DOI":"10.1145\/582095.582099"},{"key":"7286_CR22","doi-asserted-by":"crossref","unstructured":"Sparks, E.R. et\u00a0al. KeystoneML: optimizing pipelines for large-scale advanced analytics. In: Proceedings of the ICDE, pp. 535\u2013546, (2017)","DOI":"10.1109\/ICDE.2017.109"},{"key":"7286_CR23","first-page":"318","volume":"96","author":"D Srivastava","year":"1996","unstructured":"Srivastava, D., Dar, S., Jagadish, H.V., Levy, A.Y.: Answering queries with aggregation using views. VLDB 96, 318\u2013329 (1996)","journal-title":"VLDB"},{"key":"7286_CR24","unstructured":"Venkataraman, S., Yang, Z. et\u00a0al. Ernest: efficient performance prediction for large-scale advanced analytics. In: Proceedings of the NDIS, pp. 363\u2013378, (2016)"},{"key":"7286_CR25","doi-asserted-by":"crossref","unstructured":"Wang, G., Chan, C.Y.: Multi-query optimization in MapReduce framework. In: Proceedings of the PVLDB, pp. 145\u2013156, (2013)","DOI":"10.14778\/2732232.2732234"},{"key":"7286_CR26","unstructured":"Wang, K., Khan, M.M.H.: Performance prediction for apache spark platform. In: Proceedings of the HPCC, pp. 166\u2013173, (2015)"},{"key":"7286_CR27","doi-asserted-by":"crossref","unstructured":"Wang, K., Khan, M.M.H., Nguyen, N., Gokhale, S.: Modeling interference for apache spark jobs. In: Proceedings of the CLOUD, pp. 423\u2013431. IEEE, (2016)","DOI":"10.1109\/CLOUD.2016.0063"},{"key":"7286_CR28","unstructured":"Xin, R., Deyhim, P., Ghodsi, A., Meng, X., Zaharia, M.: Graysort on apache spark by databricks. In: Proceedings of the GraySort Competition, (2014)"},{"key":"7286_CR29","first-page":"245","volume":"87","author":"HZ Yang","year":"1987","unstructured":"Yang, H.Z., Larson, P.A.: Query transformation for PSJ-queries. PVLDB 87, 245\u2013254 (1987)","journal-title":"PVLDB"},{"key":"7286_CR30","doi-asserted-by":"crossref","unstructured":"Zhang, Y. et\u00a0al. SRBench: a streaming RDF \/ SPARQL Benchmark. In: Proceedings of the ISWC, pp. 641\u2013657, (2012)","DOI":"10.1007\/978-3-642-35176-1_40"},{"key":"7286_CR31","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Cherkasova, L., Loo, B.T.: Performance modeling of MapReduce jobs in heterogeneous cloud environments. In: Proceedings of the CLOUD, pp. 839\u2013846, (2013)","DOI":"10.1109\/CLOUD.2013.107"},{"key":"7286_CR32","unstructured":"Zhou, P., Ruan, Z. et\u00a0al.: Doppio: I\/o-aware performance analysis, modeling and optimization for in-memory computing framework. In: Proceedings of the ISPASS, pp. 22\u201332. IEEE, (2018)"}],"container-title":["Distributed and Parallel Databases"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s10619-020-07286-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1007\/s10619-020-07286-y\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s10619-020-07286-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,3,10]],"date-time":"2021-03-10T00:40:05Z","timestamp":1615336805000},"score":1,"resource":{"primary":{"URL":"http:\/\/link.springer.com\/10.1007\/s10619-020-07286-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,3,10]]},"references-count":32,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2020,12]]}},"alternative-id":["7286"],"URL":"https:\/\/doi.org\/10.1007\/s10619-020-07286-y","relation":{},"ISSN":["0926-8782","1573-7578"],"issn-type":[{"type":"print","value":"0926-8782"},{"type":"electronic","value":"1573-7578"}],"subject":[],"published":{"date-parts":[[2020,3,10]]},"assertion":[{"value":"10 March 2020","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}