{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T17:38:34Z","timestamp":1740159514432,"version":"3.37.3"},"reference-count":36,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2022,5,31]],"date-time":"2022-05-31T00:00:00Z","timestamp":1653955200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,5,31]],"date-time":"2022-05-31T00:00:00Z","timestamp":1653955200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100006764","name":"Technische Universit\u00e4t Berlin","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100006764","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Datenbank Spektrum"],"published-print":{"date-parts":[[2022,7]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires significant insights into expected job runtimes and scaling behavior, resource characteristics, input data distributions, and other factors. Unable to estimate performance accurately, users frequently overprovision resources for their jobs, leading to low resource utilization and high costs.<\/jats:p><jats:p>In this paper, we present major building blocks towards a\u00a0collaborative approach for optimization of data processing cluster configurations based on runtime data and performance models. We believe that runtime data can be shared and used for performance models across different execution contexts, significantly reducing the reliance on the recurrence of individual processing jobs or, else, dedicated job profiling. For this, we describe how the similarity of processing jobs and cluster infrastructures can be employed to combine suitable data points from local and global job executions into accurate performance models. Furthermore, we outline approaches to performance prediction via more context-aware and reusable models. Finally, we lay out how metrics from previous executions can be combined with runtime monitoring to effectively re-configure models and clusters dynamically.<\/jats:p>","DOI":"10.1007\/s13222-022-00416-z","type":"journal-article","created":{"date-parts":[[2022,5,31]],"date-time":"2022-05-31T12:02:41Z","timestamp":1653998561000},"page":"143-151","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A\u00a0Research Overview"],"prefix":"10.1007","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3755-1503","authenticated-orcid":false,"given":"Lauritz","family":"Thamsen","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dominik","family":"Scheinert","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jonathan","family":"Will","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jonathan","family":"Bader","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Odej","family":"Kao","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2022,5,31]]},"reference":[{"key":"416_CR1","volume-title":"MDM","author":"K Aberer","year":"2007","unstructured":"Aberer K, Hauswirth M, Salehi A (2007) Infrastructure for data processing in large-scale interconnected sensor networks. In: MDM. IEEE,"},{"key":"416_CR2","doi-asserted-by":"publisher","DOI":"10.1007\/s10619-020-07286-y","volume-title":"A\u00a0Gray-box modeling methodology for runtime prediction of Apache spark jobs","author":"H Al-Sayeh","year":"2020","unstructured":"Al-Sayeh H, Hagedorn S, Sattler K-U (2020) A\u00a0Gray-box modeling methodology for runtime prediction of Apache spark jobs. DPD"},{"key":"416_CR3","volume-title":"NSDI","author":"O Alipourfard","year":"2017","unstructured":"Alipourfard O, Liu HH, Chen J, Venkataraman S, Yu M, Zhang M (2017) Cherrypick: adaptively unearthing the best cloud configurations for big data analytics. In: NSDI. USENIX,"},{"key":"416_CR4","volume-title":"BigData","author":"J Bader","year":"2021","unstructured":"Bader J, Thamsen L, Kulagina S, Will J, Meyerhenke H, Kao O (2021) Tarema: adaptive resource allocation for scalable scientific workflows in heterogeneous clusters. In: BigData. IEEE,"},{"key":"416_CR5","volume-title":"SoCC","author":"M Bilal","year":"2020","unstructured":"Bilal M, Canini M, Rodrigues R (2020) Finding the right cloud configuration for analytics clusters. In: SoCC. ACM,"},{"key":"416_CR6","series-title":"DE Bulletin","volume-title":"Apache Flink: stream and batch processing in a\u00a0single engine","author":"P Carbone","year":"2015","unstructured":"Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K (2015) Apache Flink: stream and batch processing in a\u00a0single engine. DE Bulletin"},{"key":"416_CR7","volume-title":"IEEE ISTA","author":"PK Chan","year":"1999","unstructured":"Chan PK, Fan W, Prodromidis AL, Stolfo SJ (1999) Distributed data mining in credit card fraud detection. In: IEEE ISTA"},{"key":"416_CR8","series-title":"FGCS","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2018.06.032","volume-title":"A\u00a0gray-box performance model for Apache Spark","author":"Z Chao","year":"2018","unstructured":"Chao Z, Shi S, Gao H, Luo J, Wang H (2018) A\u00a0gray-box performance model for Apache Spark. FGCS"},{"key":"416_CR9","volume-title":"APSys","author":"Y Cheng","year":"2018","unstructured":"Cheng Y, Chai Z, Anwar A (2018) Characterizing co-located datacenter workloads: an Alibaba case study. In: APSys. ACM,"},{"key":"416_CR10","volume-title":"WWW","author":"A Das","year":"2007","unstructured":"Das A, Datar M, Garg A, Rajaram S (2007) Google News personalization: scalable online collaborative filtering. In: WWW. ACM,"},{"key":"416_CR11","series-title":"CACM","doi-asserted-by":"publisher","DOI":"10.1145\/1327452.1327492","volume-title":"Mapreduce: simplified data processing on large clusters","author":"J Dean","year":"2008","unstructured":"Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. CACM"},{"key":"416_CR12","series-title":"ACM SIGPLAN notices","volume-title":"Quasar: resource-efficient and QoS-aware cluster management","author":"C Delimitrou","year":"2014","unstructured":"Delimitrou C, Kozyrakis C (2014) Quasar: resource-efficient and QoS-aware cluster management. ACM SIGPLAN notices"},{"key":"416_CR13","volume-title":"SIGMOD","author":"B Ding","year":"2019","unstructured":"Ding B, Das S, Marcus R, Wu W, Chaudhuri S, Narasayya VR (2019) AI meets AI: leveraging query executions to improve index recommendations. In: SIGMOD. ACM,"},{"key":"416_CR14","volume-title":"ICDCS","author":"C-J Hsu","year":"2018","unstructured":"Hsu C-J, Nair V, Freeh VW, Menzies T (2018) Arrow: low-level augmented Bayesian optimization for finding the best cloud VM. In: ICDCS. IEEE,"},{"key":"416_CR15","volume-title":"CLOUD","author":"C-J Hsu","year":"2018","unstructured":"Hsu C-J, Nair V, Menzies T, Freeh V (2018) Micky: a\u00a0cheaper alternative for selecting cloud instances. In: CLOUD. IEEE,"},{"key":"416_CR16","volume-title":"PDCAT","author":"J Koch","year":"2017","unstructured":"Koch J, Thamsen L, Schmidt F, Kao O (2017) SMiPE: estimating the progress of recurring iterative distributed dataflows. In: PDCAT. IEEE,"},{"key":"416_CR17","doi-asserted-by":"publisher","DOI":"10.14778\/3461535.3461549","volume-title":"Towards cost-optimal query processing in the cloud","author":"V Leis","year":"2021","unstructured":"Leis V, Kuschewski M (2021) Towards cost-optimal query processing in the cloud. VLDB"},{"key":"416_CR18","volume-title":"DASC","author":"H Liu","year":"2011","unstructured":"Liu H (2011) A\u00a0measurement study of server utilization in public clouds. In: DASC. IEEE,"},{"key":"416_CR19","doi-asserted-by":"crossref","unstructured":"K.\u00a0Rajan, D.\u00a0Kakadia, C.\u00a0Curino, and S.\u00a0Krishnan. PerfOrator: Eloquent Performance Models for Resource Optimization. In SoCC, 2016.","DOI":"10.1145\/2987550.2987566"},{"key":"416_CR20","volume-title":"BigData","author":"D Scheinert","year":"2021","unstructured":"Scheinert D, Alamgiralem A, Bader J, Will J, Wittkopp T, Thamsen L (2021) On the potential of execution traces for batch processing workload optimization in public clouds. In: BigData. IEEE,"},{"key":"416_CR21","volume-title":"CLUSTER","author":"D Scheinert","year":"2021","unstructured":"Scheinert D, Thamsen L, Zhu H, Will J, Acker A, Wittkopp T, Kao O (2021) Bellamy: reusing performance models for distributed dataflow jobs across contexts. In: CLUSTER. IEEE,"},{"key":"416_CR22","volume-title":"IPCCC","author":"D Scheinert","year":"2021","unstructured":"Scheinert D, Zhu H, Thamsen L, Geldenhuys MK, Will J, Acker A, Kao O (2021) Enel: context-aware dynamic scaling of distributed dataflow jobs using graph propagation. In: IPCCC. IEEE,"},{"key":"416_CR23","volume-title":"CNSM","author":"S Shah","year":"2019","unstructured":"Shah S, Amannejad Y, Krishnamurthy D, Wang M (2019) Quick execution time predictions for spark applications. In: CNSM. IEEE,"},{"key":"416_CR24","volume-title":"CCGrid","author":"S Sidhanta","year":"2016","unstructured":"Sidhanta S, Golab W, Mukhopadhyay S (2016) OptEx: a\u00a0deadline-aware cost optimization model for spark. In: CCGrid. IEEE,"},{"key":"416_CR25","volume-title":"IPCCC","author":"L Thamsen","year":"2016","unstructured":"Thamsen L, Verbitskiy I, Schmidt F, Renner T, Kao O (2016) Selecting resources for distributed dataflow systems according to runtime targets. In: IPCCC. IEEE,"},{"key":"416_CR26","volume-title":"CloudCom","author":"L Thamsen","year":"2017","unstructured":"Thamsen L, Verbitskiy I, Beilharz J, Renner T, Polze A, Kao O (2017) Ellis: dynamically scaling distributed dataflows to meet runtime targets. In: CloudCom. IEEE,"},{"key":"416_CR27","series-title":"CCPE","volume-title":"Mary, Hugo, and Hugo*: learning to schedule distributed data-parallel processing jobs on shared clusters","author":"L Thamsen","year":"2021","unstructured":"Thamsen L, Beilharz J, Tran VT, Nedelkoski S, Kao O (2021) Mary, Hugo, and Hugo*: learning to schedule distributed data-parallel processing jobs on shared clusters. CCPE"},{"key":"416_CR28","volume-title":"NSDI","author":"S Venkataraman","year":"2016","unstructured":"Venkataraman S, Yang Z, Franklin M, Recht B, Stoica I (2016) Ernest: efficient performance prediction for large-scale advanced analytics. In: NSDI. USENIX,"},{"key":"416_CR29","volume-title":"CloudCom","author":"I Verbitskiy","year":"2018","unstructured":"Verbitskiy I, Thamsen L, Renner T, Kao O (2018) CoBell: runtime prediction for distributed dataflow jobs in shared clusters. In: CloudCom. IEEE,"},{"key":"416_CR30","volume-title":"HPCC","author":"K Wang","year":"2015","unstructured":"Wang K, Khan MMH (2015) Performance prediction for Apache Spark platform. In: HPCC. IEEE,"},{"key":"416_CR31","volume-title":"BigData","author":"J Will","year":"2020","unstructured":"Will J, Bader J, Thamsen L (2020) Towards collaborative optimization of cluster configurations for distributed dataflow jobs. In: BigData"},{"key":"416_CR32","doi-asserted-by":"crossref","unstructured":"Will J, Arslan O, Bader J, Scheinert D, Thamsen L (2021) Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud. In BigData","DOI":"10.1109\/BigData52589.2021.9671742"},{"key":"416_CR33","volume-title":"Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds. In IC2E","author":"J Will","year":"2021","unstructured":"Will J, Thamsen L, Scheinert D, Bader J, Kao OCO (2021) Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds. In IC2E"},{"key":"416_CR34","doi-asserted-by":"publisher","first-page":"33","DOI":"10.1016\/j.is.2019.01.006","volume":"82","author":"C Witt","year":"2019","unstructured":"Witt\u00a0C, Bux\u00a0M, Gusew\u00a0W, Leser\u00a0U (2019) Predictive Performance Modeling for Distributed Batch Processing Using Black Box Monitoring and Machine Learning. IS 82:33\u201352. https:\/\/doi.org\/10.1016\/j.is.2019.01.006","journal-title":"IS"},{"key":"416_CR35","unstructured":"Zaharia\u00a0M, Chowdhury\u00a0M, Franklin\u00a0MJ, Shenker\u00a0S, Stoica\u00a0I et\u00a0al (2010) Spark: Cluster Computing with Working Sets. HotCloud"},{"key":"416_CR36","volume-title":"Doppio: I\/O-Aware Performance Analysis, Modeling and Optimization for In-memory Computing Framework. In ISPASS","author":"P Zhou","year":"2018","unstructured":"Zhou P, Ruan Z, Fang Z, Shand M, Roazen D, Cong J (2018) Doppio: I\/O-Aware Performance Analysis, Modeling and Optimization for In-memory Computing Framework. In ISPASS. IEEE"}],"container-title":["Datenbank-Spektrum"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s13222-022-00416-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s13222-022-00416-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s13222-022-00416-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,7,29]],"date-time":"2022-07-29T11:23:56Z","timestamp":1659093836000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s13222-022-00416-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,5,31]]},"references-count":36,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,7]]}},"alternative-id":["416"],"URL":"https:\/\/doi.org\/10.1007\/s13222-022-00416-z","relation":{},"ISSN":["1618-2162","1610-1995"],"issn-type":[{"type":"print","value":"1618-2162"},{"type":"electronic","value":"1610-1995"}],"subject":[],"published":{"date-parts":[[2022,5,31]]},"assertion":[{"value":"1 February 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 May 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"31 May 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}