{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T03:42:33Z","timestamp":1760240553104,"version":"build-2065373602"},"reference-count":44,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2019,8,9]],"date-time":"2019-08-09T00:00:00Z","timestamp":1565308800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>Evaluating and predicting the performance of big data applications are required to efficiently size capacities and manage operations. Gaining profound insights into the system architecture, dependencies of components, resource demands, and configurations cause difficulties to engineers. To address these challenges, this paper presents an approach to automatically extract and transform system specifications to predict the performance of applications. It consists of three components. First, a system-and tool-agnostic domain-specific language (DSL) allows the modeling of performance-relevant factors of big data applications, computing resources, and data workload. Second, DSL instances are automatically extracted from monitored measurements of Apache Spark and Apache Hadoop (i.e., YARN and HDFS) systems. Third, these instances are transformed to model- and simulation-based performance evaluation tools to allow predictions. By adapting DSL instances, our approach enables engineers to predict the performance of applications for different scenarios such as changing data input and resources. We evaluate our approach by predicting the performance of linear regression and random forest applications of the HiBench benchmark suite. Simulation results of adjusted DSL instances compared to measurement results show accurate predictions errors below 15% based upon averages for response times and resource utilization.<\/jats:p>","DOI":"10.3390\/bdcc3030047","type":"journal-article","created":{"date-parts":[[2019,8,9]],"date-time":"2019-08-09T11:11:31Z","timestamp":1565349091000},"page":"47","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["PerTract: Model Extraction and Specification of Big Data Systems for Performance Prediction by the Example of Apache Spark and Hadoop"],"prefix":"10.3390","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4382-0995","authenticated-orcid":false,"given":"Johannes","family":"Kro\u00df","sequence":"first","affiliation":[{"name":"fortiss, Research Institute of the Free State of Bavaria, Guerickestr. 25, 80805 Munich, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2754-8493","authenticated-orcid":false,"given":"Helmut","family":"Krcmar","sequence":"additional","affiliation":[{"name":"Chair for Information Systems, Technical University of Munich (TUM), Boltzmannstr. 3, 85748 Garching, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2019,8,9]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"261","DOI":"10.1007\/s12599-014-0345-1","article-title":"Big Data\u2014An interdisciplinary opportunity for information systems research","volume":"6","author":"Schermann","year":"2014","journal-title":"Bus. Inf. Syst. Eng."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"177","DOI":"10.1007\/s12599-014-0323-7","article-title":"Performance management work","volume":"6","author":"Brunnert","year":"2014","journal-title":"Bus. Inf. Syst. Eng."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Wang, K., and Khan, M.M.H. (2015, January 24\u201326). Performance Prediction for Apache Spark Platform. Proceedings of the 17th International Conference on High Performance Computing and Communications, New York, NY, USA.","DOI":"10.1109\/HPCC-CSS-ICESS.2015.246"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"157","DOI":"10.1109\/TSE.2014.2362755","article-title":"Quantitative Evaluation of Model-Driven Performance Analysis and Simulation of Component-Based Architectures","volume":"41","author":"Brosig","year":"2015","journal-title":"IEEE Trans. Softw. Eng."},{"key":"ref_5","unstructured":"Brunnert, A., van Hoorn, A., Willnecker, F., Danciu, A., Hasselbring, W., Heger, C., Herbst, N., Jamshidi, P., Jung, R., and von Kistowski, J. (2015). Performance-Oriented DevOps: A Research Agenda, SPEC Research Group\u2014DevOps Performance Working Group, Standard Performance Evaluation Corporation (SPEC). Available online: http:\/\/research.spec.org\/fileadmin\/user_upload\/documents\/wg_devops\/endorsed_publications\/SPEC-RG-2015-001-DevOpsPerformanceResearchAgenda.pdf."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1016\/j.jss.2008.03.066","article-title":"The Palladio component model for model-driven performance prediction","volume":"82","author":"Becker","year":"2009","journal-title":"J. Syst. Softw."},{"key":"ref_7","unstructured":"Kro\u00df, J. (2019, August 07). PerTract. Available online: https:\/\/github.com\/johanneskross\/pertract."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"243","DOI":"10.1007\/978-3-319-23267-6_16","article-title":"Stream Processing on Demand for Lambda Architectures","volume":"Volume 9272","author":"Knottenbelt","year":"2015","journal-title":"Computer Performance Engineering"},{"key":"ref_9","unstructured":"Kro\u00df, J., Brunnert, A., and Krcmar, H. (2015, January 4\u20136). Modeling Big Data Systems by Extending the Palladio Component Model. Proceedings of the 2015 Symposium on Software Performance, Munich, Germany."},{"key":"ref_10","unstructured":"Kro\u00df, J., and Krcmar, H. (2016, January 8\u20139). Modeling and Simulating Apache Spark Streaming Applications. Proceedings of the 2016 Symposium on Software Performance, Kiel, Germany."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Kro\u00df, J., and Krcmar, H. (2017, January 20\u201322). Model-based Performance Evaluation of Batch and Stream Applications for Big Data. Proceedings of the IEEE 25th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), Banff, AB, Canada.","DOI":"10.1109\/MASCOTS.2017.21"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"495","DOI":"10.1007\/s10766-012-0227-4","article-title":"Analytical Performance Models for MapReduce Workloads","volume":"41","author":"Vianna","year":"2013","journal-title":"Int. J. Parallel Program."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"328","DOI":"10.1016\/j.peva.2014.07.020","article-title":"Profiling and evaluating hardware choices for MapReduce environments: An application-aware approach","volume":"79","author":"Verma","year":"2014","journal-title":"Perform. Eval."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Cherkasova, L., and Loo, B.T. (2013, January 21\u201324). Benchmarking Approach for Designing a Mapreduce Performance Model. Proceedings of the ACM\/SPEC International Conference on Performance Engineering, Prague, Czech Republic.","DOI":"10.1145\/2479871.2479906"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Cherkasova, L., and Loo, B.T. (July, January 28). Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments. Proceedings of the 2013 IEEE Sixth International Conference on Cloud Computing, Santa Clara, CA, USA.","DOI":"10.1109\/CLOUD.2013.107"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"38","DOI":"10.1145\/2788402.2788409","article-title":"Exploiting Cloud Heterogeneity to Optimize Performance and Cost of MapReduce Processing","volume":"42","author":"Zhang","year":"2015","journal-title":"SIGMETRICS Perform. Eval. Rev."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"345","DOI":"10.1016\/j.future.2013.12.036","article-title":"Performance evaluation of NoSQL big-data applications using multi-formalism models","volume":"37","author":"Barbierato","year":"2014","journal-title":"Future Gener. Comput. Syst."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Carretero, J., Garcia-Blas, J., Ko, R.K., Mueller, P., and Nakano, K. (2016). Modeling Performance of Hadoop Applications: A Journey from Queueing Networks to Stochastic Well Formed Nets. Algorithms and Architectures for Parallel Processing, Springer International Publishing. Lecture Notes in Computer Science.","DOI":"10.1007\/978-3-319-49583-5"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Lehrig, S. (2014, January 22). Applying Architectural Templates for Design-Time Scalability and Elasticity Analyses of SaaS Applications. Proceedings of the 2nd International Workshop on Hot Topics in Cloud Service Scalability, Dublin, Ireland.","DOI":"10.1145\/2649563.2649573"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Ardagna, D., Barbierato, E., Evangelinou, A., Gianniti, E., Gribaudo, M., Pinto, T.B.M., Guimar\u00e3es, A., da Silva, A.P.C., and Almeida, J.M. (2018, January 9\u201313). Performance Prediction of Cloud-Based Big Data Applications. Proceedings of the ACM\/SPEC International Conference on Performance Engineering, Berlin, Germany.","DOI":"10.1145\/3184407.3184420"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Nambiar, R., and Poess, M. (2018). Performance Assurance Model for Applications on SPARK Platform. Performance Evaluation and Benchmarking for the Analytics Era, Springer International Publishing. Lecture Notes in Computer Science.","DOI":"10.1007\/978-3-319-72401-0"},{"key":"ref_22","unstructured":"Venkataraman, S., Yang, Z., Franklin, M., Recht, B., and Stoica, I. (2016, January 13\u201317). Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. Proceedings of the 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), Santa Clara, CA, USA."},{"key":"ref_23","unstructured":"Alipourfard, O., Liu, H.H., Chen, J., Venkataraman, S., Yum, M., and Zhang, M. (2017, January 27\u201329). CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), Boston, MA, USA."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1016\/j.is.2019.01.006","article-title":"Predictive performance modeling for distributed batch processing using black box monitoring and machine learning","volume":"82","author":"Witt","year":"2019","journal-title":"Inf. Syst."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"1127","DOI":"10.1002\/spe.2269","article-title":"Modeling performances of concurrent big data applications","volume":"45","author":"Castiglione","year":"2014","journal-title":"Softw. Pract. Exp."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Niemann, R. (2016, January 12\u201316). Towards the Prediction of the Performance and Energy Efficiency of Distributed Data Management Systems. Proceedings of the ACM\/SPEC International Conference on Performance Engineering, Delft, The Netherlands.","DOI":"10.1145\/2859889.2859891"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Casale, G., Ardagna, D., Artac, M., Barbier, F., Nitto, E.D., Henry, A., Iuhasz, G., Joubert, C., Merseguer, J., and Munteanu, V.I. (2015, January 16\u201324). DICE: Quality-driven Development of Data-intensive Cloud Applications. Proceedings of the Seventh International Workshop on Modeling in Software Engineering, Florence, Italy.","DOI":"10.1109\/MiSE.2015.21"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Guerriero, M., Tajfar, S., Tamburri, D.A., and Di Nitto, E. (,  2016). Towards a Model-driven Design Tool for Big Data Architectures. Proceedings of the 2nd International Workshop on BIG Data Software Engineering, Austin, TX, USA.","DOI":"10.1145\/2896825.2896835"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"G\u00f3mez, A., Merseguer, J., Di Nitto, E., and Tamburri, D.A. (2016, January 21). Towards a UML Profile for Data Intensive Applications. Proceedings of the 2Nd International Workshop on Quality-Aware DevOps, Saarbr\u00fccken, Germany.","DOI":"10.1145\/2945408.2945412"},{"key":"ref_30","unstructured":"Ginis, R., and Strom, R.E. (2010). Method for Predicting Performance of Distributed Stream Processing Systems. (7,818,417), U.S. Patent."},{"key":"ref_31","unstructured":"Steinberg, D., Budinsky, F., Paternostro, M., and Merks, E. (2009). EMF: Eclipse Modeling Framework, Addison-Wesley. [2nd ed.]."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"King, B. (2004). Performance Assurance for IT Systems, Auerbach Publications.","DOI":"10.1201\/9780203334577"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"83","DOI":"10.1007\/s11576-007-0030-9","article-title":"Cost accounting for shared IT infrastructures","volume":"49","author":"Brandl","year":"2007","journal-title":"Wirtschaftsinformatik"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"239","DOI":"10.1016\/j.jss.2015.08.030","article-title":"Continuous Performance Evaluation and Capacity Planning Using Resource Profiles for Enterprise Applications","volume":"123","author":"Brunnert","year":"2017","journal-title":"J. Syst. Softw."},{"key":"ref_35","unstructured":"Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., and Stoica, I. (2012, January 25\u201327). Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, San Jose, CA, USA."},{"key":"ref_36","unstructured":"Apache Spark (2018, February 19). Lightning-Fast Cluster Computing. Available online: https:\/\/spark.apache.org."},{"key":"ref_37","unstructured":"Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., and Chun, B.G. (2015, January 4\u20136). Making Sense of Performance in Data Analytics Frameworks. Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation, Oakland, CA, USA."},{"key":"ref_38","unstructured":"Apache Hadoop (2017, January 01). Welcome to Apache Hadoop!. Available online: https:\/\/hadoop.apache.org\/."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1145\/1327452.1327492","article-title":"MapReduce: Simplified Data Processing on Large Clusters","volume":"51","author":"Dean","year":"2008","journal-title":"Commun. ACM"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"75","DOI":"10.2307\/25148625","article-title":"Design Science in Information Systems Research","volume":"28","author":"Hevner","year":"2004","journal-title":"MIS Q."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Huang, S., Huang, J., Dai, J., Xie, T., and Huang, B. (2010, January 1\u20136). The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. Proceedings of the 26th International Conference on Data Engineering Workshops, Long Beach, CA, USA.","DOI":"10.1109\/ICDEW.2010.5452747"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Wohlin, C., Runeson, P., H\u00f6st, M., Ohlsson, M.C., Regnell, B., and Wessl\u00e9n, A. (2012). Experimentation in Software Engineering, Springer.","DOI":"10.1007\/978-3-642-29044-2"},{"key":"ref_43","unstructured":"Heinrich, R., Eichelberger, H., and Schmid, K. (2016, January 2). Performance Modeling in the Age of Big Data\u2014Some Reflections on Current Limitations. Proceedings of the 3rd International Workshop on Interplay of Model-Driven and Component-Based Software Engineering, Saint-Malo, France."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"23:1","DOI":"10.1145\/3019596","article-title":"Modeling and Extracting Load Intensity Profiles","volume":"11","author":"Kistowski","year":"2017","journal-title":"ACM Trans. Auton. Adapt. Syst."}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/3\/3\/47\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T13:09:53Z","timestamp":1760188193000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/3\/3\/47"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,8,9]]},"references-count":44,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2019,9]]}},"alternative-id":["bdcc3030047"],"URL":"https:\/\/doi.org\/10.3390\/bdcc3030047","relation":{},"ISSN":["2504-2289"],"issn-type":[{"type":"electronic","value":"2504-2289"}],"subject":[],"published":{"date-parts":[[2019,8,9]]}}}