{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T14:15:36Z","timestamp":1774966536023,"version":"3.50.1"},"reference-count":52,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2025,2,1]],"date-time":"2025-02-01T00:00:00Z","timestamp":1738368000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>In this study, we analyze the performance of the machine learning operators in Apache Spark MLlib for K-Means, Random Forest Regression, and Word2Vec. We used a multi-node Spark cluster along with collected detailed execution metrics computed from the data of diverse datasets and parameter settings. The data were used to train predictive models that had up to 98% accuracy in forecasting performance. By building actionable predictive models, our research provides a unique treatment for key hyperparameter tuning, scalability, and real-time resource allocation challenges. Specifically, the practical value of traditional models in optimizing Apache Spark MLlib workflows was shown, achieving up to 30% resource savings and a 25% reduction in processing time. These models enable system optimization, reduce the amount of computational overheads, and boost the overall performance of big data applications. Ultimately, this work not only closes significant gaps in predictive performance modeling, but also paves the way for real-time analytics over a distributed environment.<\/jats:p>","DOI":"10.3390\/a18020074","type":"journal-article","created":{"date-parts":[[2025,2,3]],"date-time":"2025-02-03T12:18:56Z","timestamp":1738585136000},"page":"74","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":22,"title":["Optimizing Apache Spark MLlib: Predictive Performance of Large-Scale Models for Big Data Analytics"],"prefix":"10.3390","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0891-6780","authenticated-orcid":false,"given":"Leonidas","family":"Theodorakopoulos","sequence":"first","affiliation":[{"name":"Department of Management Science and Technology, University of Patras, 26334 Patras, Greece"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4632-6511","authenticated-orcid":false,"given":"Aristeidis","family":"Karras","sequence":"additional","affiliation":[{"name":"Computer Engineering and Informatics Department, University of Patras, 26504 Patras, Greece"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-0008-547X","authenticated-orcid":false,"given":"George A.","family":"Krimpas","sequence":"additional","affiliation":[{"name":"Computer Engineering and Informatics Department, University of Patras, 26504 Patras, Greece"}]}],"member":"1968","published-online":{"date-parts":[[2025,2,1]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Assali, T., Ayoub, Z.T., and Ouni, S. (2024, January 22\u201325). Multivariate LSTM for Execution Time Prediction in HPC for Distributed Deep Learning Training. Proceedings of the 2024 IEEE 27th International Symposium on Real-Time Distributed Computing (ISORC), Tunis, Tunisia.","DOI":"10.1109\/ISORC61049.2024.10551326"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Salman, S.M., Dao, V.L., Papadopoulos, A.V., Mubeen, S., and Nolte, T. (2023, January 23\u201325). Scheduling Firm Real-time Applications on the Edge with Single-bit Execution Time Prediction. Proceedings of the 2023 IEEE 26th International Symposium on Real-Time Distributed Computing (ISORC), Nashville, TN, USA.","DOI":"10.1109\/ISORC58943.2023.00037"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Chen, R. (2024, January 23\u201324). Research on the Performance of Collaborative Filtering Algorithms in Library Book Recommendation Systems: Optimization of the Spark ALS Model. Proceedings of the 2024 International Conference on Integrated Circuits and Communication Systems (ICICACS), Raichur, India.","DOI":"10.1109\/ICICACS60521.2024.10499133"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Han, M. (2023, January 24\u201326). Research on optimization of K-means Algorithm Based on Spark. Proceedings of the 2023 IEEE 6th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China.","DOI":"10.1109\/ITNEC56291.2023.10082476"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"256","DOI":"10.1109\/TCC.2017.2732344","article-title":"Predicting workflow task execution time in the cloud using a two-stage machine learning approach","volume":"8","author":"Pham","year":"2017","journal-title":"IEEE Trans. Cloud Comput."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"e7905","DOI":"10.1002\/cpe.7905","article-title":"Improving prediction of computational job execution times with machine learning","volume":"36","author":"Balis","year":"2024","journal-title":"Concurr. Comput. Pract. Exp."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Schizas, N., Karras, A., Karras, C., and Sioutas, S. (2022). TinyML for Ultra-Low Power AI and Large Scale IoT Deployments: A Systematic Review. Future Internet, 14.","DOI":"10.3390\/fi14120363"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Karras, A., Giannaros, A., Theodorakopoulos, L., Krimpas, G.A., Kalogeratos, G., Karras, C., and Sioutas, S. (2023). FLIBD: A federated learning-based IoT big data management approach for privacy-preserving over Apache Spark with FATE. Electronics, 12.","DOI":"10.3390\/electronics12224633"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Karras, A., Karras, C., Giotopoulos, K.C., Tsolis, D., Oikonomou, K., and Sioutas, S. (2023). Federated Edge Intelligence and Edge Caching Mechanisms. Information, 14.","DOI":"10.3390\/info14070414"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Sewal, P., and Singh, H. (2022, January 25\u201327). A Machine Learning Approach for Predicting Execution Statistics of Spark Application. Proceedings of the 2022 Seventh International Conference on Parallel, Distributed and Grid Computing (PDGC), Solan, India.","DOI":"10.1109\/PDGC56933.2022.10053356"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Ye, G., Liu, W., Wu, C.Q., Shen, W., and Lyu, X. (2020, January 6\u20138). On Machine Learning-based Stage-aware Performance Prediction of Spark Applications. Proceedings of the 2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC), Austin, TX, USA.","DOI":"10.1109\/IPCCC50635.2020.9391564"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"3123","DOI":"10.1093\/comjnl\/bxab131","article-title":"A hybrid machine learning approach for performance modeling of cloud-based big data applications","volume":"65","author":"Ataie","year":"2022","journal-title":"Comput. J."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Gulino, A., Canakoglu, A., Ceri, S., and Ardagna, D. (2020, January 17\u201319). Performance Prediction for Data-driven Workflows on Apache Spark. Proceedings of the 2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Nice, France.","DOI":"10.1109\/MASCOTS50786.2020.9285944"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"1332","DOI":"10.1109\/TPDS.2018.2800011","article-title":"Learning-Based Memory Allocation Optimization for Delay-Sensitive Big Data Processing","volume":"29","author":"Tsai","year":"2018","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"G\u00e1rate-Escamilla, A.K., El Hassani, A.H., and Andres, E. (2019, January 28\u201330). Big data execution time based on Spark Machine Learning Libraries. Proceedings of the 2019 3rd International Conference on Cloud and Big Data Computing, Oxford, UK.","DOI":"10.1145\/3358505.3358519"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Wang, G., Xu, J., and He, B. (2016, January 12\u201314). A Novel Method for Tuning Configuration Parameters of Spark Based on Machine Learning. Proceedings of the 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC\/SmartCity\/DSS), Sydney, NSW, Australia.","DOI":"10.1109\/HPCC-SmartCity-DSS.2016.0088"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Lu, X., Shankar, D., Gugnani, S., and Panda, D.K. (2016, January 5\u20138). High-performance design of apache spark with RDMA and its benefits on various workloads. Proceedings of the 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, USA.","DOI":"10.1109\/BigData.2016.7840611"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Manzi, D., and Tompkins, D. (2016, January 4\u20138). Exploring GPU Acceleration of Apache Spark. Proceedings of the 2016 IEEE International Conference on Cloud Engineering (IC2E), Berlin, Germany.","DOI":"10.1109\/IC2E.2016.30"},{"key":"ref_19","first-page":"1","article-title":"MFRLMO: Model-free reinforcement learning for multi-objective optimization of apache spark","volume":"11","year":"2024","journal-title":"EAI Endorsed Trans. Scalable Inf. Syst."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Ishizaki, K. (2019, January 7\u201311). Analyzing and optimizing java code generation for apache spark query plan. Proceedings of the 2019 ACM\/SPEC International Conference on Performance Engineering, Mumbai, India.","DOI":"10.1145\/3297663.3310300"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"493","DOI":"10.3390\/jcp3030025","article-title":"Autonomous vehicles: Sophisticated attacks, safety issues, challenges, open topics, blockchain, and future directions","volume":"3","author":"Giannaros","year":"2023","journal-title":"J. Cybersecur. Priv."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Theodorakopoulos, L., Karras, A., Theodoropoulou, A., and Kampiotis, G. (2024). Benchmarking Big Data Systems: Performance and Decision-Making Implications in Emerging Technologies. Technologies, 12.","DOI":"10.3390\/technologies12110217"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Karras, A., Giannaros, A., Karras, C., Theodorakopoulos, L., Mammassis, C.S., Krimpas, G.A., and Sioutas, S. (2024). TinyML algorithms for Big Data Management in large-scale IoT systems. Future Internet, 16.","DOI":"10.3390\/fi16020042"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"386","DOI":"10.1108\/IJLM-01-2020-0043","article-title":"The impact of emerging and disruptive technologies on freight transportation in the digital era: Current state and future trends","volume":"32","author":"Dong","year":"2021","journal-title":"Int. J. Logist. Manag."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Ohlhorst, F.J. (2012). Big Data Analytics: Turning Big Data into Big Money, John Wiley & Sons.","DOI":"10.1002\/9781119205005"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"77","DOI":"10.47604\/ijscm.2547","article-title":"Integration of Emerging Technologies AI and ML into Strategic Supply Chain Planning Processes to Enhance Decision-Making and Agility","volume":"9","author":"Vummadi","year":"2024","journal-title":"Int. J. Supply Chain. Manag."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Sun, Z. (2019). Intelligent big data analytics: A managerial perspective. Managerial Perspectives on Intelligent Big Data Analytics, IGI Global.","DOI":"10.4018\/978-1-5225-7277-0"},{"key":"ref_28","first-page":"1","article-title":"Multimedia big data analytics: A survey","volume":"51","author":"Pouyanfar","year":"2018","journal-title":"ACM Comput. Surv. (CSUR)"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Sterling, M. (2017, January 19\u201322). Situated big data and big data analytics for healthcare. Proceedings of the 2017 IEEE Global Humanitarian Technology Conference (GHTC), San Jose, CA, USA.","DOI":"10.1109\/GHTC.2017.8239322"},{"key":"ref_30","first-page":"3","article-title":"Perspectives on big data and big data analytics","volume":"3","author":"Ularu","year":"2012","journal-title":"Database Syst. J."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Crowder, J.A., Carbone, J., Friess, S., Crowder, J.A., Carbone, J., and Friess, S. (2020). Data analytics: The big data analytics process (bdap) architecture. Artificial Psychology: Psychological Modeling and Testing of AI Systems, Springer.","DOI":"10.1007\/978-3-030-17081-3"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Padilha, B., Schwerz, A.L., and Roberto, R.L. (2017, January 5\u20138). WED-SQL: A Relational Framework for Design and Implementation of Process-Aware Information Systems. Proceedings of the 2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW), Atlanta, GA, USA.","DOI":"10.1109\/ICDCSW.2017.46"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"65","DOI":"10.1049\/iet-cps.2017.0068","article-title":"Developing IoT applications: Challenges and frameworks","volume":"3","author":"Udoh","year":"2018","journal-title":"IET Cyber-Phys. Syst. Theory Appl."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Horii, S. (2020, January 21\u201326). Improved computation-communication trade-off for coded distributed computing using linear dependence of intermediate values. Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA.","DOI":"10.1109\/ISIT44484.2020.9174132"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"5496","DOI":"10.1109\/TIT.2022.3158828","article-title":"Storage-Computation-Communication Tradeoff in Distributed Computing: Fundamental Limits and Complexity","volume":"68","author":"Yan","year":"2022","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_36","unstructured":"Jangda, A., Huang, J., Liu, G., Sabet, A.H.N., Maleki, S., Miao, Y., Musuvathi, M., Mytkowicz, T., and Saarikivi, O. (March, January 28). Breaking the computation and communication abstraction barrier in distributed machine learning workloads. Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland."},{"key":"ref_37","unstructured":"Hu, H., Jiang, C., Zhong, Y., Peng, Y., Wu, C., Zhu, Y., Lin, H., and Guo, C. (2022). dPRO: A Generic Profiling and Optimization System for Expediting Distributed DNN Training. arXiv."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"639","DOI":"10.1109\/TCC.2021.3108043","article-title":"Dynamic resource provisioning for iterative workloads on Apache Spark","volume":"11","author":"Cheng","year":"2021","journal-title":"IEEE Trans. Cloud Comput."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Kordelas, A., Spyrou, T., Voulgaris, S., Megalooikonomou, V., and Deligiannis, N. (2023, January 23\u201325). KORDI: A Framework for Real-Time Performance and Cost Optimization of Apache Spark Streaming. Proceedings of the 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Raleigh, NC, USA.","DOI":"10.1109\/ISPASS57527.2023.00045"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"111028","DOI":"10.1016\/j.jss.2021.111028","article-title":"Tuning configuration of apache spark on public clouds by combining multi-objective optimization and performance prediction model","volume":"180","author":"Cheng","year":"2021","journal-title":"J. Syst. Softw."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Geng, J., Li, D., Cheng, Y., Wang, S., and Li, J. (2018, January 24). HiPS: Hierarchical parameter synchronization in large-scale distributed machine learning. Proceedings of the 2018 Workshop on Network Meets AI & ML, Budapest, Hungary.","DOI":"10.1145\/3229543.3229544"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Nascimento, J.P.B., Capanema, D.O., and Pereira, A.C.M. (2017, January 18\u201320). Assessing and improving the performance and scalability of an iterative algorithm for Hadoop. Proceedings of the 2017 Computing Conference, London, UK.","DOI":"10.1109\/SAI.2017.8252224"},{"key":"ref_43","unstructured":"Sahith, C.S.K., Muppidi, S., and Merugula, S. (2023, January 20\u201321). Apache Spark Big data Analysis, Performance Tuning, and Spark Application Optimization. Proceedings of the 2023 International Conference on Evolutionary Algorithms and Soft Computing Techniques (EASCT), Bengaluru, India."},{"key":"ref_44","unstructured":"Ousterhout, K. (2017). Architecting for Performance Clarity in Data Analytics Frameworks. [Ph.D. Thesis, UC Berkeley]."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"341","DOI":"10.1111\/1467-8551.12355","article-title":"Big data and predictive analytics and manufacturing performance: Integrating institutional theory, resource-based view and big data culture","volume":"30","author":"Dubey","year":"2019","journal-title":"Br. J. Manag."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Gupta, Y.K., and Kumari, S. (2021, January 3\u20135). Performance Evaluation of Distributed Machine Learning for Cardiovascular Disease Prediction in Spark. Proceedings of the 2021 5th International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India.","DOI":"10.1109\/ICOEI51242.2021.9452955"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Assefi, M., Behravesh, E., Liu, G., and Tafti, A.P. (2017, January 11\u201314). Big data machine learning using apache spark MLlib. Proceedings of the 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA.","DOI":"10.1109\/BigData.2017.8258338"},{"key":"ref_48","unstructured":"Atefinia, R., and Ahmadi, M. (2022). Performance evaluation of Apache Spark MLlib algorithms on an intrusion detection dataset. arXiv."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Karras, A., Karras, C., Bompotas, A., Bouras, P., Theodorakopoulos, L., and Sioutas, S. (2022, January 25\u201327). SparkReact: A Novel and User-friendly Graphical Interface for the Apache Spark MLlib Library. Proceedings of the 26th Pan-Hellenic Conference on Informatics, Athens, Greece.","DOI":"10.1145\/3575879.3575998"},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"56214","DOI":"10.1109\/ACCESS.2023.3281484","article-title":"Effective Feature Engineering Technique for Heart Disease Prediction with Machine Learning","volume":"11","author":"Qadri","year":"2023","journal-title":"IEEE Access"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Azeroual, O., and Nikiforova, A. (2022). Apache spark and mllib-based intrusion detection system or how the big data technologies can secure the data. Information, 13.","DOI":"10.3390\/info13020058"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Esmaeilzadeh, A., Heidari, M., Abdolazimi, R., Hajibabaee, P., and Malekzadeh, M. (2022, January 26\u201329). Efficient Large Scale NLP Feature Engineering with Apache Spark. Proceedings of the 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA.","DOI":"10.1109\/CCWC54503.2022.9720765"}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/18\/2\/74\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T16:25:23Z","timestamp":1760027123000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/18\/2\/74"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,1]]},"references-count":52,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2025,2]]}},"alternative-id":["a18020074"],"URL":"https:\/\/doi.org\/10.3390\/a18020074","relation":{},"ISSN":["1999-4893"],"issn-type":[{"value":"1999-4893","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,1]]}}}