{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,30]],"date-time":"2026-06-30T01:17:02Z","timestamp":1782782222758,"version":"3.54.5"},"reference-count":34,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,1,16]],"date-time":"2024-01-16T00:00:00Z","timestamp":1705363200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,1,16]],"date-time":"2024-01-16T00:00:00Z","timestamp":1705363200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The extensive use of HPC infrastructures and frameworks for running data-intensive applications has led to a growing interest in data partitioning techniques and strategies. In fact, application performance can be heavily affected by how data are partitioned, which in turn depends on the selected size for data blocks, i.e. the block size. Therefore, finding an effective partitioning, i.e. a suitable block size, is a key strategy to speed-up parallel data-intensive applications and increase scalability. This paper describes a methodology, namely BLEST-ML (BLock size ESTimation through Machine Learning), for block size estimation that relies on supervised machine learning techniques. The proposed methodology was evaluated by designing an implementation tailored to <jats:italic>dislib<\/jats:italic>, a distributed computing library highly focused on machine learning algorithms built on top of the PyCOMPSs framework. We assessed the effectiveness of the provided implementation through an extensive experimental evaluation considering different algorithms from dislib, datasets, and infrastructures, including the MareNostrum 4 supercomputer. The results we obtained show the ability of BLEST-ML to efficiently determine a suitable way to split a given dataset, thus providing a proof of its applicability to enable the efficient execution of data-parallel applications in high performance environments.<\/jats:p>","DOI":"10.1186\/s40537-023-00862-w","type":"journal-article","created":{"date-parts":[[2024,1,16]],"date-time":"2024-01-16T18:01:28Z","timestamp":1705428088000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["Block size estimation for data partitioning in HPC applications using machine learning techniques"],"prefix":"10.1186","volume":"11","author":[{"given":"Riccardo","family":"Cantini","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Fabrizio","family":"Marozzo","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Alessio","family":"Orsino","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Domenico","family":"Talia","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Paolo","family":"Trunfio","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Rosa M.","family":"Badia","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jorge","family":"Ejarque","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Fernando","family":"V\u00e1zquez-Novoa","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2024,1,16]]},"reference":[{"issue":"1","key":"862_CR1","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-019-0253-9","volume":"6","author":"A Gandomi","year":"2019","unstructured":"Gandomi A, Reshadi M, Movaghar A, Khademzadeh A. Hybsmrp: a hybrid scheduling algorithm in Hadoop Mapreduce framework. J Big Data. 2019;6(1):1\u201316.","journal-title":"J Big Data"},{"key":"862_CR2","doi-asserted-by":"crossref","unstructured":"Carver B, Zhang J, Wang A, Anwar A, Wu P, Cheng Y. Wukong: a scalable and locality-enhanced framework for serverless parallel computing. In: Proceedings of the 11th ACM Symposium on Cloud Computing, 2020; 1\u2013 15.","DOI":"10.1145\/3419111.3421286"},{"key":"862_CR3","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.4229","volume":"29","author":"F Marozzo","year":"2017","unstructured":"Marozzo F, Rodrigo Duro F, Garcia Blas J, Carretero J, Talia D, Trunfio P. A data-aware scheduling strategy for workflow execution in clouds. Concurr Comput Pract Exp. 2017;29: e4229.","journal-title":"Concurr Comput Pract Exp"},{"key":"862_CR4","doi-asserted-by":"publisher","first-page":"47354","DOI":"10.1109\/ACCESS.2021.3067815","volume":"9","author":"S Giamp\u00e0","year":"2021","unstructured":"Giamp\u00e0 S, Belcastro L, Marozzo F, Talia D, Trunfio P. A data-aware scheduling strategy for executing large-scale distributed workflows. IEEE Access. 2021;9:47354\u201364.","journal-title":"IEEE Access"},{"key":"862_CR5","unstructured":"Apache Spark. https:\/\/spark.apache.org\/."},{"key":"862_CR6","unstructured":"Apache Hadoop. https:\/\/hadoop.apache.org\/."},{"key":"862_CR7","doi-asserted-by":"crossref","unstructured":"Ansel J, Kamil S, Veeramachaneni K, Ragan-Kelley J, Bosboom J, O\u2019Reilly U-M, Amarasinghe S. Opentuner: an extensible framework for program autotuning. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, 2014: 303\u2013 316.","DOI":"10.1145\/2628071.2628092"},{"key":"862_CR8","unstructured":"Barcelona Supercomputing Center (BSC): MareNostrum IV Technical Information. https:\/\/www.bsc.es\/marenostrum\/marenostrum\/technical-information."},{"issue":"2","key":"862_CR9","doi-asserted-by":"publisher","first-page":"85","DOI":"10.26599\/BDMA.2019.9020015","volume":"3","author":"MS Mahmud","year":"2020","unstructured":"Mahmud MS, Huang JZ, Salloum S, Emara TZ, Sadatdiynov K. A survey of data partitioning and sampling methods to support big data analysis. Big Data Mining Anal. 2020;3(2):85\u2013101.","journal-title":"Big Data Mining Anal"},{"issue":"2","key":"862_CR10","doi-asserted-by":"publisher","first-page":"134","DOI":"10.1002\/int.21833","volume":"32","author":"S Ram\u00edrez-Gallego","year":"2017","unstructured":"Ram\u00edrez-Gallego S, Lastra I, Mart\u00ednez-Rego D, Bol\u00f3n-Canedo V, Ben\u00edtez JM, Herrera F, Alonso-Betanzos A. Fast-mrmr: fast minimum redundancy maximum relevance algorithm for high-dimensional big data. Int J Intell Syst. 2017;32(2):134\u201352.","journal-title":"Int J Intell Syst"},{"key":"862_CR11","doi-asserted-by":"publisher","first-page":"287","DOI":"10.1016\/j.ins.2018.10.052","volume":"496","author":"R-J Palma-Mendoza","year":"2019","unstructured":"Palma-Mendoza R-J, de Marcos L, Rodriguez D, Alonso-Betanzos A. Distributed correlation-based feature selection in spark. Inf Sci. 2019;496:287\u201399.","journal-title":"Inf Sci"},{"issue":"13","key":"862_CR12","doi-asserted-by":"publisher","first-page":"1353","DOI":"10.14778\/3007263.3007273","volume":"9","author":"M Al-Kateb","year":"2016","unstructured":"Al-Kateb M, Sinclair P, Au G, Ballinger C. Hybrid row-column partitioning in Teradata\u00ae. Proc VLDB Endow. 2016;9(13):1353\u201364.","journal-title":"Proc VLDB Endow"},{"key":"862_CR13","doi-asserted-by":"crossref","unstructured":"Schreiner GA, Duarte D, Dal\u00a0Bianco G, Mello RdS. A hybrid partitioning strategy for newsql databases: The voltdb case. iiWAS2019, pp. 353\u2013360. Association for Computing Machinery, New York, NY, USA 2019;","DOI":"10.1145\/3366030.3366062"},{"issue":"1","key":"862_CR14","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-019-0205-4","volume":"6","author":"S Salloum","year":"2019","unstructured":"Salloum S, Huang JZ, He Y. Exploring and cleaning big data with random sample data blocks. J Big Data. 2019;6(1):1\u201328.","journal-title":"J Big Data"},{"issue":"11","key":"862_CR15","doi-asserted-by":"publisher","first-page":"5846","DOI":"10.1109\/TII.2019.2912723","volume":"15","author":"S Salloum","year":"2019","unstructured":"Salloum S, Huang JZ, He Y. Random sample partition: a distributed data model for big data analysis. IEEE Trans Industr Inf. 2019;15(11):5846\u201354.","journal-title":"IEEE Trans Industr Inf"},{"key":"862_CR16","doi-asserted-by":"crossref","unstructured":"Wei C, Salloum S, Emara TZ, Zhang X, Huang JZ, He Y. A two-stage data processing algorithm to generate random sample partitions for big data analysis. In: International Conference on Cloud Computing. Springer 2018;347\u2013 364","DOI":"10.1007\/978-3-319-94295-7_24"},{"issue":"1","key":"862_CR17","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-021-00410-4","volume":"8","author":"S Migliorini","year":"2021","unstructured":"Migliorini S, Belussi A, Quintarelli E, Carra D. Copart: a context-based partitioning technique for big data. J Big Data. 2021;8(1):1\u201328.","journal-title":"J Big Data"},{"key":"862_CR18","unstructured":"Bertolucci M, Carlini E, Dazzi P, Lulli A, Ricci L. Static and dynamic big data partitioning on apache spark. In: Parallel Computing: On the Road to Exascale, Proceedings of the International Conference on Parallel Computing, 2015;489\u2013498."},{"key":"862_CR19","unstructured":"Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I. Resilient distributed datasets: a Fault-Tolerant abstraction for In-Memory cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), 2012; 15\u201328."},{"issue":"2","key":"862_CR20","doi-asserted-by":"publisher","first-page":"154","DOI":"10.1145\/3296957.3173206","volume":"53","author":"S Wang","year":"2018","unstructured":"Wang S, Li C, Hoffmann H, Lu S, Sentosa W, Kistijantoro AI. Understanding and auto-adjusting performance-sensitive configurations. Acm Sigplan Notices. 2018;53(2):154\u201368.","journal-title":"Acm Sigplan Notices"},{"key":"862_CR21","doi-asserted-by":"crossref","unstructured":"Cantini R, Marozzo F, Orsino A, Talia D, Trunfio P. Exploiting machine learning for improving in-memory execution of data-intensive workflows on parallel machines. Future Internet. 2021; 13(5).","DOI":"10.3390\/fi13050121"},{"issue":"8","key":"862_CR22","doi-asserted-by":"publisher","first-page":"7989","DOI":"10.1007\/s11227-020-03612-4","volume":"77","author":"O Sukhoroslov","year":"2021","unstructured":"Sukhoroslov O. Toward efficient execution of data-intensive workflows. J Supercomput. 2021;77(8):7989\u20138012.","journal-title":"J Supercomput"},{"key":"862_CR23","doi-asserted-by":"crossref","unstructured":"Belcastro L, Cantini R, Marozzo F, Orsino A, Talia D, Trunfio P. Programming big data analysis: principles and solutions. J Big Data 2022; 9(4).","DOI":"10.1186\/s40537-021-00555-2"},{"key":"862_CR24","doi-asserted-by":"crossref","unstructured":"Marozzo F, Talia D, Trunfio P. Scalable script-based data analysis workflows on clouds. In: Proceedings of the 8th Workshop on Workflows in Support of Large-scale Science, 2013; 124\u2013133.","DOI":"10.1145\/2534248.2534261"},{"issue":"1","key":"862_CR25","doi-asserted-by":"publisher","first-page":"66","DOI":"10.1177\/1094342015594678","volume":"31","author":"E Tejedor","year":"2017","unstructured":"Tejedor E, Becerra Y, Alomar G, Queralt A, Badia RM, Torres J, Cortes T, Labarta J. Pycompss. Int J High Perform Comput Appl. 2017;31(1):66\u201382.","journal-title":"Int J High Perform Comput Appl"},{"key":"862_CR26","doi-asserted-by":"crossref","unstructured":"\u00c1lvarez Cid-Fuentes J, Sol\u00e0 S, \u00c1lvarez P, Castro-Ginard A, Badia RM. dislib: large Scale High Performance Machine Learning in Python. In: Proceedings of the 15th International Conference on eScience, 2019; 96\u2013105.","DOI":"10.1109\/eScience.2019.00018"},{"issue":"1","key":"862_CR27","doi-asserted-by":"publisher","first-page":"67","DOI":"10.1007\/s10723-013-9272-5","volume":"12","author":"F Lordan","year":"2014","unstructured":"Lordan F, et al. Services: an interoperable programming framework for the cloud. J Grid Comput. 2014;12(1):67\u201391.","journal-title":"J Grid Comput"},{"key":"862_CR28","doi-asserted-by":"crossref","unstructured":"Baldi P, Cranmer K, Faucett T, Sadowski P, Whiteson D. Parameterized machine learning for high-energy physics. arXiv preprint 2016; arXiv:1601.07913.","DOI":"10.1140\/epjc\/s10052-016-4099-4"},{"issue":"11","key":"862_CR29","doi-asserted-by":"publisher","first-page":"2278","DOI":"10.1109\/5.726791","volume":"86","author":"Y LeCun","year":"1998","unstructured":"LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278\u2013324.","journal-title":"Proc IEEE"},{"key":"862_CR30","doi-asserted-by":"crossref","unstructured":"Mariani G, Anghel A, Jongerius R, Dittmann G. Scaling application properties to exascale. In: Proceedings of the 12th ACM International Conference on Computing Frontiers,2015: 1\u20138.","DOI":"10.1145\/2742854.2742860"},{"key":"862_CR31","unstructured":"Schmuck F, Haskin R. $$\\{$$GPFS$$\\}$$: A $$\\{$$Shared-Disk$$\\}$$ file system for large computing clusters. In: Conference on File and Storage Technologies (FAST 02) 2002."},{"key":"862_CR32","doi-asserted-by":"publisher","first-page":"346","DOI":"10.1016\/j.commatsci.2018.07.052","volume":"154","author":"K Hamidieh","year":"2018","unstructured":"Hamidieh K. A data-driven statistical model for predicting the critical temperature of a superconductor. Comput Mater Sci. 2018;154:346\u201354.","journal-title":"Comput Mater Sci"},{"issue":"19","key":"862_CR33","doi-asserted-by":"publisher","first-page":"4342","DOI":"10.3390\/s19194342","volume":"19","author":"GS Sampaio","year":"2019","unstructured":"Sampaio GS, de Aguiar Vallim Filho AR, da Silva LS, da Silva LA. Prediction of motor failure time using an artificial neural network. Sensors. 2019;19(19):4342.","journal-title":"Sensors"},{"key":"862_CR34","doi-asserted-by":"publisher","first-page":"19","DOI":"10.1016\/j.softx.2015.06.001","volume":"1\u20132","author":"MJ Abraham","year":"2015","unstructured":"Abraham MJ, Murtola T, Schulz R, P\u00e1ll S, Smith JC, Hess B, Lindahl E. Gromacs: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX. 2015;1\u20132:19\u201325.","journal-title":"SoftwareX"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-023-00862-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-023-00862-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-023-00862-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,1,16]],"date-time":"2024-01-16T18:04:15Z","timestamp":1705428255000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-023-00862-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,16]]},"references-count":34,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["862"],"URL":"https:\/\/doi.org\/10.1186\/s40537-023-00862-w","relation":{},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,16]]},"assertion":[{"value":"16 January 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 December 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 January 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"19"}}