{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,31]],"date-time":"2025-12-31T00:39:09Z","timestamp":1767141549388,"version":"build-2238731810"},"reference-count":38,"publisher":"Springer Science and Business Media LLC","issue":"13","license":[{"start":{"date-parts":[[2022,4,17]],"date-time":"2022-04-17T00:00:00Z","timestamp":1650153600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,4,17]],"date-time":"2022-04-17T00:00:00Z","timestamp":1650153600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100011033","name":"Agencia Estatal de Investigaci\u00f3n","doi-asserted-by":"publisher","award":["PID2020-112496GB-I00"],"award-info":[{"award-number":["PID2020-112496GB-I00"]}],"id":[{"id":"10.13039\/501100011033","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100011104","name":"Universitat Aut\u00f2noma de Barcelona","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100011104","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Supercomput"],"published-print":{"date-parts":[[2022,9]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Due to the increase and complexity of computer systems, reducing the overhead of fault tolerance techniques has become important in recent years. One technique in fault tolerance is checkpointing, which saves a snapshot with the information that has been computed up to a specific moment, suspending the execution of the application, consuming I\/O resources and network bandwidth. Characterizing the files that are generated when performing the checkpoint of a parallel application is useful to determine the resources consumed and their impact on the I\/O system. It is also important to characterize the application that performs checkpoints, and one of these characteristics is whether the application does I\/O. In this paper, we present a model of checkpoint behavior for parallel applications that performs I\/O; this depends on the application and on other factors such as the number of processes, the mapping of processes and the type of I\/O used. These characteristics will also influence scalability, the resources consumed and their impact on the IO system. Our model describes the behavior of the checkpoint size based on the characteristics of the system and the type (or model) of I\/O used, such as the number I\/O aggregator processes, the buffering size utilized by the two-phase I\/O optimization technique and components of collective file I\/O operations. The BT benchmark and FLASH I\/O are analyzed under different configurations of aggregator processes and buffer size to explain our approach. The model can be useful when selecting what type of checkpoint configuration is more appropriate according to the applications\u2019 characteristics and resources available. Thus, the user will be able to know how much storage space the checkpoint consumes and how much the application consumes, in order to establish policies that help improve the distribution of resources.<\/jats:p>","DOI":"10.1007\/s11227-022-04482-8","type":"journal-article","created":{"date-parts":[[2022,4,17]],"date-time":"2022-04-17T05:02:51Z","timestamp":1650171771000},"page":"15404-15436","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["A model of checkpoint behavior for applications that have I\/O"],"prefix":"10.1007","volume":"78","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1778-0237","authenticated-orcid":false,"given":"Betzabeth","family":"Le\u00f3n","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5793-1928","authenticated-orcid":false,"given":"Sandra","family":"M\u00e9ndez","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0002-7046","authenticated-orcid":false,"given":"Daniel","family":"Franco","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5500-850X","authenticated-orcid":false,"given":"Dolores","family":"Rexachs","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2884-3232","authenticated-orcid":false,"given":"Emilio","family":"Luque","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2022,4,17]]},"reference":[{"key":"4482_CR1","doi-asserted-by":"publisher","unstructured":"Ouyang X, Gopalakrishnan K, Gangadharappa T, Panda DK (2009) Fast checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on multicore architecture. In 2009 International Conference on High Performance Computing (HiPC), pp 99\u2013108. https:\/\/doi.org\/10.1109\/HIPC.2009.5433218","DOI":"10.1109\/HIPC.2009.5433218"},{"key":"4482_CR2","doi-asserted-by":"crossref","unstructured":"Leon B, Gomez P, Franco D, Rexachs D, Luque E (2020) Analysis of Checkpoint I\/O behavior. In: International Conference on Computational Science (ICCS), S. N. S. A. 2020, Ed., ser. Lecture Notes in Computer Science, vol. 12137, Springer Nature Switzerland AG, pp 191\u2013205","DOI":"10.1007\/978-3-030-50371-0_14"},{"issue":"2","key":"4482_CR3","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3152891","volume":"51","author":"FZ Boito","year":"2018","unstructured":"Boito FZ, Inacio EC, Bez JL, Navaux PO, Dantas MA, Denneulin Y (2018) A checkpoint of research on parallel I\/O for high-performance computing. ACM Comput Surv (CSUR) 51(2):1\u201335","journal-title":"ACM Comput Surv (CSUR)"},{"issue":"3","key":"4482_CR4","first-page":"63","volume":"5","author":"DH Bailey","year":"1991","unstructured":"Bailey DH, Barszcz E, Barton JT et al (1991) The NAS parallel benchmarks. The Int J Supercomput Appl 5(3):63\u201373","journal-title":"The Int J Supercomput Appl"},{"key":"4482_CR5","unstructured":"The HDF Group. Hierarchical Data Format, version 5. (1997-2018), [Online]. Available: http:\/\/www.hdfgroup.org\/HDF5\/"},{"key":"4482_CR6","doi-asserted-by":"publisher","unstructured":"Li J, Liao W-k, Choudhary A, et al. (2003) Parallel netCDF: a high - performance scientific I\/O interface. In: Supercomputing, 2003 ACM\/IEEE Conference, Nov. 2003, pp 39\u201339. https:\/\/doi.org\/10.1109\/SC.2003.10053","DOI":"10.1109\/SC.2003.10053"},{"key":"4482_CR7","unstructured":"Unidata. Network Common Data Form (netCDF) (2018) [Online]. Available: http:\/\/doi.org\/10.5065\/D6H70CW6"},{"key":"4482_CR8","doi-asserted-by":"publisher","unstructured":"Kang Q, Ross R, Latham R, et al. (2020) Improving all-to-many personalized communication in two-phase I\/O. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1\u201313. https:\/\/doi.org\/10.1109\/SC41405.2020.00014","DOI":"10.1109\/SC41405.2020.00014"},{"key":"4482_CR9","doi-asserted-by":"crossref","unstructured":"Thakur R, Gropp W, Lusk E (1999) Data sieving and collective I\/O in ROMIO. In: Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Computation, ser. FRONTIERS \u201999, Washington, DC, USA: IEEE Computer Society, pp 182\u2013, isbn: 0-7695-0087-0. [Online]. Available: http:\/\/dl.acm.org\/citation.cfm?id=795668. 796733","DOI":"10.1109\/FMPC.1999.750599"},{"key":"4482_CR10","doi-asserted-by":"publisher","first-page":"312","DOI":"10.1109\/CLUSTER.2010.36","volume":"2010","author":"K Ohta","year":"2010","unstructured":"Ohta K, Kimpe D, Cope J, Iskra K, Ross R, Ishikawa Y (2010) Optimization techniques at the I\/O forwarding layer. IEEE Int Conf Clust Comput 2010:312\u2013321. https:\/\/doi.org\/10.1109\/CLUSTER.2010.36","journal-title":"IEEE Int Conf Clust Comput"},{"issue":"1","key":"4482_CR11","doi-asserted-by":"publisher","first-page":"361","DOI":"10.1007\/s11227-010-0440-0","volume":"59","author":"R Filgueira","year":"2012","unstructured":"Filgueira R, Carretero J, Singh DE, Calderon A, N\u00fa\u00f1ez A (2012) Dynamic - CoMPI: dynamic optimization techniques for MPI parallel applications. J Supercomput 59(1):361\u2013391","journal-title":"J Supercomput"},{"key":"4482_CR12","unstructured":"Thakur R, Ross R, Lusk E, Gropp W, Latham R (2010) Users guide for ROMIO: a high-performance. portable MPI-IO implementation. [Online]. Available: https:\/\/www.mcs.anl.gov\/projects\/romio"},{"key":"4482_CR13","unstructured":"Project TOM (2021) Tuning the OMPIO parallel I\/O component, [Online]. Available: http:\/\/www.open-mpi.org\/faq\/?category=ompio#how-can-i-use-omio"},{"key":"4482_CR14","doi-asserted-by":"crossref","unstructured":"Elliott J, Kharbas K, Fiala D, Mueller F, Ferreira K, Engelmann C (2012) Combining partial redundancy and checkpointing for HPC. In: 2012 IEEE 32nd International Conference on Distributed Computing Systems, pp 615\u2013626","DOI":"10.1109\/ICDCS.2012.56"},{"key":"4482_CR15","doi-asserted-by":"publisher","unstructured":"Akber, S\u00a0Muhammad\u00a0Abrar, Chen H, Wang Y, Jin H (2018) Minimizing overheads of checkpoints in distributed stream processing systems. In: 2018 IEEE 7th International Conference on Cloud Networking (CloudNet), pp 1\u20134. https:\/\/doi.org\/10.1109\/CloudNet.2018.8549548","DOI":"10.1109\/CloudNet.2018.8549548"},{"key":"4482_CR16","doi-asserted-by":"publisher","unstructured":"Coti C, Herault T, Lemarinier P, et al. (2006) Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. In: SC \u201906: Proceedings of the 2006 ACM\/IEEE Conference on Supercomputing, pp 1\u201318. https:\/\/doi.org\/10.1109\/SC.2006.15","DOI":"10.1109\/SC.2006.15"},{"key":"4482_CR17","doi-asserted-by":"publisher","unstructured":"Estahbanati M\u00a0Gholami, Schintke F (2019) Multilevel checkpoint\/restart for large computational jobs on distributed computing resources. In: 2019 38th Symposium on Reliable Distributed Systems (SRDS), pp 143\u2013 149. https:\/\/doi.org\/10.1109\/SRDS47363.2019.00025","DOI":"10.1109\/SRDS47363.2019.00025"},{"key":"4482_CR18","doi-asserted-by":"crossref","unstructured":"Ansel J, Arya K, Cooperman G (2009) DMTCP: transparent checkpointing for cluster computations and the desktop. In: IEEE International Symposium on Parallel & Distributed Processing. IEEE 2009:1\u201312","DOI":"10.1109\/IPDPS.2009.5161063"},{"key":"4482_CR19","unstructured":"Wong P, Van\u00a0der\u00a0Wijngaart RF (2003) NAS parallel benchmarks I\/O version 2.4. NAS Technical Report NAS-03-00. [Online]. Available: https:\/\/www.nas.nasa.gov\/assets\/pdf\/techreports\/2003\/nas-03- 002.pdf"},{"key":"4482_CR20","first-page":"36","volume":"3","author":"M Kumar","year":"2014","unstructured":"Kumar M, Choudhary A, Kumar V (2014) A comparison between different checkpoint schemes with advantages and disadvantages. Int J Comput Appl Nat Semin Recent Adv Wirel Netw Commun 3:36\u201339","journal-title":"Int J Comput Appl Nat Semin Recent Adv Wirel Netw Commun"},{"key":"4482_CR21","first-page":"783","volume":"2018","author":"D Dauwe","year":"2018","unstructured":"Dauwe D, Pasricha S, Maciejewski AA, Siegel HJ (2018) An analysis of multilevel checkpoint performance models. IEEE Int Parallel Distrib Process Symp Workshops (IPDPSW) 2018:783\u2013792","journal-title":"IEEE Int Parallel Distrib Process Symp Workshops (IPDPSW)"},{"key":"4482_CR22","doi-asserted-by":"crossref","unstructured":"Losada N, Mart\u00edn MJ, Rodr\u00edguez G, Gozn\u00e1lez P (2015) I\/O optimization in the checkpointing of openMP parallel applications. In: 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp 222\u2013229","DOI":"10.1109\/PDP.2015.39"},{"key":"4482_CR23","doi-asserted-by":"publisher","unstructured":"Wang N, Sun Q, Liu Y, Qian D (2018) Mitigating I\/O impact of checkpointing on large scale parallel systems. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC\/SmartCity\/DSS), pp 117\u2013123. https:\/\/doi.org\/10.1109\/HPCC\/SmartCity\/DSS.2018.00047","DOI":"10.1109\/HPCC\/SmartCity\/DSS.2018.00047"},{"key":"4482_CR24","doi-asserted-by":"publisher","unstructured":"Qian Y, Yi R, Du Y, Xiao N, Jin S (2013) Dynamic i\/o congestion control in scalable lustre file system. In: 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST), May 2013, pp 1\u20135. https:\/\/doi.org\/10.1109\/MSST.2013.6558432","DOI":"10.1109\/MSST.2013.6558432"},{"key":"4482_CR25","first-page":"93","volume":"2014","author":"N El-Sayed","year":"2014","unstructured":"El-Sayed N, Schroeder B (2014) To checkpoint or not to checkpoint: understanding energy-performance-i\/o tradeoffs in hpc checkpointing. IEEE Int Conf Cluster Comput (CLUSTER) 2014:93\u2013102","journal-title":"IEEE Int Conf Cluster Comput (CLUSTER)"},{"issue":"3","key":"4482_CR26","doi-asserted-by":"publisher","first-page":"163","DOI":"10.1007\/s00354-013-0302-4","volume":"31","author":"I Cores","year":"2013","unstructured":"Cores I, Rodr\u0131guez G, Gonz\u00e1lez P, Osorio RR et al (2013) Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes. New Gener Comput 31(3):163\u2013185","journal-title":"New Gener Comput"},{"issue":"4","key":"4482_CR27","first-page":"199","volume":"4","author":"A Kongmunvattana","year":"2015","unstructured":"Kongmunvattana A (2015) Reducing checkpoint creation overhead using data similarity. Int J Comput 4(4):199\u2013206","journal-title":"Int J Comput"},{"key":"4482_CR28","doi-asserted-by":"crossref","unstructured":"Rusu C, Grecu C, Anghel L (2008) Improving the scalability of checkpoint recovery for networks-on-chip. In: 2008 IEEE International Symposium on Circuits and Systems, May 2008, pp 2793\u20132796","DOI":"10.1109\/ISCAS.2008.4542037"},{"key":"4482_CR29","doi-asserted-by":"crossref","unstructured":"Chaarawi M, Gabriel E (2011) Automatically selecting the number of aggregators for collective I\/O operations. In: 2011 IEEE International Conference on Cluster Computing, IEEE, 2011, pp 428\u2013437","DOI":"10.1109\/CLUSTER.2011.79"},{"issue":"11","key":"4482_CR30","doi-asserted-by":"publisher","first-page":"2682","DOI":"10.1109\/TPDS.2020.3000458","volume":"31","author":"Q Kang","year":"2020","unstructured":"Kang Q, Lee S, Hou K et al (2020) Improving MPI collective I\/O for high volume non-contiguous requests with intra-node aggregation. IEEE Trans Parallel Distrib Syst 31(11):2682\u20132695. https:\/\/doi.org\/10.1109\/TPDS.2020.3000458","journal-title":"IEEE Trans Parallel Distrib Syst"},{"key":"4482_CR31","doi-asserted-by":"publisher","first-page":"120","DOI":"10.1109\/CLUSTER.2016.37","volume":"2016","author":"G Congiu","year":"2016","unstructured":"Congiu G, Narasimhamurthy S, S\u00fc\u00df T, Brinkmann A (2016) Improving collective I\/O performance using non-volatile memory devices. IEEE Int Conf Cluster Comput (CLUSTER) 2016:120\u2013129. https:\/\/doi.org\/10.1109\/CLUSTER.2016.37","journal-title":"IEEE Int Conf Cluster Comput (CLUSTER)"},{"key":"4482_CR32","doi-asserted-by":"publisher","unstructured":"Bagbaba A (2021) A comparative study of MPI-IO libraries for offloading of collective I\/O tasks. In: 2021 International Conference on Engineering and Emerging Technologies (ICEET), pp 1\u20136. https:\/\/doi.org\/10.1109\/ICEET53442.2021.9659767","DOI":"10.1109\/ICEET53442.2021.9659767"},{"key":"4482_CR33","unstructured":"M\u00e9ndez S, Rexachs D, Luque E (2012) Evaluating utilization of the I\/O system on computer clusters. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), The Steering Committee of The World Congress in Computer Science, 2012, pp 1\u20137"},{"key":"4482_CR34","doi-asserted-by":"crossref","unstructured":"Le\u00f3n B, Franco D, Rexachs D, Luque E (2020) Analysis of parallel application checkpoint storage for system configuration, J Supercomput, 1\u201336","DOI":"10.1007\/s11227-020-03445-1"},{"key":"4482_CR35","doi-asserted-by":"crossref","unstructured":"Thakur R, Gropp W, Lusk E (1999) Data sieving and collective I\/O in ROMIO. In: Proceedings. Frontiers \u201999. Seventh Symposium on the Frontiers of Massively Parallel Computation, IEEE, pp 182\u2013189","DOI":"10.1109\/FMPC.1999.750599"},{"key":"4482_CR36","unstructured":"A. Laboratory, Flash IO Benchmark, Tech. Rep., (2013) [Online]. Available: http:\/\/www.mcs.anl.gov\/research\/projects\/pio-benchmark\/"},{"key":"4482_CR37","doi-asserted-by":"publisher","unstructured":"Fineberg S, Wong P, Nitzberg B, Kuszmaul C (1996) PMPIO-a portable implementation of MPI-IO. In: Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers \u201996), pp 188\u2013195. https:\/\/doi.org\/10.1109\/FMPC.1996.558082","DOI":"10.1109\/FMPC.1996.558082"},{"key":"4482_CR38","doi-asserted-by":"publisher","unstructured":"Shan H, Antypas K, Shalf J (2008) Characterizing and predicting the I\/O performance of HPC applications using a parameterized synthetic benchmark. In: SC \u201908: Proceedings of the 2008 ACM\/IEEE Conference on Supercomputing, pp 1\u201312. https:\/\/doi.org\/10.1109\/SC.2008.5222721","DOI":"10.1109\/SC.2008.5222721"}],"updated-by":[{"DOI":"10.1007\/s11227-022-04571-8","type":"correction","label":"Correction","source":"publisher","updated":{"date-parts":[[2022,5,9]],"date-time":"2022-05-09T00:00:00Z","timestamp":1652054400000}}],"container-title":["The Journal of Supercomputing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11227-022-04482-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11227-022-04482-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11227-022-04482-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,8,8]],"date-time":"2022-08-08T11:13:50Z","timestamp":1659957230000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11227-022-04482-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,4,17]]},"references-count":38,"journal-issue":{"issue":"13","published-print":{"date-parts":[[2022,9]]}},"alternative-id":["4482"],"URL":"https:\/\/doi.org\/10.1007\/s11227-022-04482-8","relation":{},"ISSN":["0920-8542","1573-0484"],"issn-type":[{"value":"0920-8542","type":"print"},{"value":"1573-0484","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,4,17]]},"assertion":[{"value":"23 March 2022","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 April 2022","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 May 2022","order":3,"name":"change_date","label":"Change Date","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"Correction","order":4,"name":"change_type","label":"Change Type","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"A Correction to this paper has been published:","order":5,"name":"change_details","label":"Change Details","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"https:\/\/doi.org\/10.1007\/s11227-022-04571-8","URL":"https:\/\/doi.org\/10.1007\/s11227-022-04571-8","order":6,"name":"change_details","label":"Change Details","group":{"name":"ArticleHistory","label":"Article History"}}]}}