{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,23]],"date-time":"2026-03-23T17:13:18Z","timestamp":1774285998234,"version":"3.50.1"},"reference-count":47,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,3,11]],"date-time":"2024-03-11T00:00:00Z","timestamp":1710115200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2024,3,31]]},"abstract":"<jats:p>\n            This article studies checkpointing strategies for parallel applications subject to failures. The optimal strategy to minimize total execution time, or makespan, is well known when failure IATs obey an Exponential distribution, but it is unknown for non-memoryless failure distributions. We explain why the latter fact is misunderstood in recent literature. We propose a general strategy that maximizes the expected efficiency until the next failure, and we show that this strategy achieves an asymptotically optimal makespan, thereby establishing the first optimality result for arbitrary failure distributions. Through extensive simulations, we show that the new strategy is always at least as good as the Young\/Daly strategy for various failure distributions. For distributions with high infant mortality (such as LogNormal with shape parameter\n            <jats:italic>k<\/jats:italic>\n            =2.51 or Weibull with shape parameter 0.5), the execution time is divided by a factor of 1.9 on average, and up to a factor 4.2 for recently deployed platforms.\n          <\/jats:p>","DOI":"10.1145\/3624560","type":"journal-article","created":{"date-parts":[[2023,9,22]],"date-time":"2023-09-22T12:07:06Z","timestamp":1695384426000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":10,"title":["Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC Platforms"],"prefix":"10.1145","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2910-3540","authenticated-orcid":false,"given":"Anne","family":"Benoit","sequence":"first","affiliation":[{"name":"Laboratoire LIP, ENS Lyon and Inria Lyon, Lyon, France"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9739-7440","authenticated-orcid":false,"given":"Lucas","family":"Perotin","sequence":"additional","affiliation":[{"name":"Laboratoire LIP, ENS Lyon and Inria Lyon, Lyon, France"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2361-055X","authenticated-orcid":false,"given":"Yves","family":"Robert","sequence":"additional","affiliation":[{"name":"Laboratoire LIP, ENS Lyon and Inria Lyon, Lyon, France"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0663-6152","authenticated-orcid":false,"given":"Fr\u00e9d\u00e9ric","family":"Vivien","sequence":"additional","affiliation":[{"name":"Laboratoire LIP, ENS Lyon and Inria Lyon, Lyon, France"}]}],"member":"320","published-online":{"date-parts":[[2024,3,11]]},"reference":[{"key":"e_1_3_4_2_2","doi-asserted-by":"publisher","DOI":"10.15803\/ijnc.6.1_2"},{"key":"e_1_3_4_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER.2017.24"},{"key":"e_1_3_4_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2016.100"},{"key":"e_1_3_4_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063427"},{"key":"e_1_3_4_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2016.2643660"},{"key":"e_1_3_4_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/2897189"},{"key":"e_1_3_4_8_2","volume-title":"IC3, Proceedings of the 14th International Conference on Contemporary Computing","author":"Benoit Anne","year":"2022","unstructured":"Anne Benoit, Yishu Du, Thomas Herault, Loris Marchal, Guillaume Pallez, Lucas Perotin, Yves Robert, Hongyang Sun, and Fr\u00e9d\u00e9ric Vivien. 2022. Checkpointing \u00e0 la Young\/Daly: An overview. In IC3, Proceedings of the 14th International Conference on Contemporary Computing. ACM Press."},{"key":"e_1_3_4_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063428"},{"key":"e_1_3_4_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPADS.2016.0125"},{"issue":"1","key":"e_1_3_4_11_2","article-title":"Toward exascale resilience: 2014 update","volume":"1","author":"Cappello Franck","year":"2014","unstructured":"Franck Cappello, Al Geist, William Gropp, Sanjay Kale, Bill Kramer, and Marc Snir. 2014. Toward exascale resilience: 2014 update. Supercomputing Frontiers and Innovations 1, 1 (2014), 5\u201328.","journal-title":"Supercomputing Frontiers and Innovations"},{"key":"e_1_3_4_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/214451.214456"},{"key":"e_1_3_4_13_2","volume-title":"A Course in Probability Theory (3 ed.)","author":"Chung Kai Lai","year":"2000","unstructured":"Kai Lai Chung. 2000. A Course in Probability Theory (3 ed.). Stanford University."},{"key":"e_1_3_4_14_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2004.11.016"},{"key":"e_1_3_4_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2014.122"},{"key":"e_1_3_4_16_2","doi-asserted-by":"crossref","unstructured":"S. Di H. Guo R. Gupta E. R. Pershey M. Snir and F. Cappello. 2018. Exploring properties and correlations of fatal events in a large-scale HPC system. In IEEE Transactions on Parallel and Distributed Systems 30 2 (2018) 361\u2013374.","DOI":"10.1109\/TPDS.2018.2864184"},{"key":"e_1_3_4_17_2","doi-asserted-by":"crossref","unstructured":"S. Di Y. Robert F. Vivien and F. Cappello. 2016. Toward an optimal online checkpoint solution under a two-level HPC checkpoint model. In IEEE Transactions on Parallel and Distributed Systems 28 1 (2016) 244\u2013259.","DOI":"10.1109\/TPDS.2016.2546248"},{"key":"e_1_3_4_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2013.6575356"},{"key":"e_1_3_4_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER.2014.6968778"},{"key":"e_1_3_4_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063443"},{"key":"e_1_3_4_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS49936.2021.00038"},{"key":"e_1_3_4_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/MASCOTS50786.2020.9285959"},{"key":"e_1_3_4_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2018.2801300"},{"key":"e_1_3_4_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3225058.3225145"},{"key":"e_1_3_4_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063444"},{"key":"e_1_3_4_26_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-20943-2"},{"key":"e_1_3_4_27_2","doi-asserted-by":"publisher","DOI":"10.15803\/ijnc.9.1_28"},{"key":"e_1_3_4_28_2","doi-asserted-by":"publisher","DOI":"10.4236\/jsea.2013.64A006"},{"key":"e_1_3_4_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/1851476.1851509"},{"key":"e_1_3_4_30_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.spl.2006.04.041"},{"key":"e_1_3_4_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/1807128.1807160"},{"key":"e_1_3_4_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/2909428.2909430"},{"key":"e_1_3_4_33_2","doi-asserted-by":"crossref","unstructured":"Yibei Ling Jie Mi and Xiaola Lin. 2001. A variational calculus approach to optimal checkpoint placement. In IEEE Transactions on Computers 50 7 (2001) 699\u2013708.","DOI":"10.1109\/12.936236"},{"key":"e_1_3_4_34_2","doi-asserted-by":"publisher","DOI":"10.2172\/984082"},{"key":"e_1_3_4_35_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jss.2009.06.058"},{"key":"e_1_3_4_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2006.5"},{"issue":"1","key":"e_1_3_4_37_2","article-title":"Understanding failures in petascale computers","volume":"78","author":"Schroeder B.","year":"2007","unstructured":"B. Schroeder and G. A. Gibson. 2007. Understanding failures in petascale computers. Journal of Physics: Conference Series 78, 1 (2007).","journal-title":"Journal of Physics: Conference Series"},{"key":"e_1_3_4_38_2","unstructured":"K. Schroiff P. Gemsjaeger and C. Bolik. 2006. Cascading failover of a data management application for shared disk file systems in loosely coupled node clusters. (2006). Retrieved from https:\/\/www.google.com\/patents\/US6990606US Patent 6 990 606."},{"issue":"2","key":"e_1_3_4_39_2","first-page":"315","article-title":"Realizing best checkpointing control in computing systems","volume":"32","author":"Sigdel P.","year":"2021","unstructured":"P. Sigdel, X. Yuan, and N. Tzeng. 2021. Realizing best checkpointing control in computing systems. IEEE TPDS 32, 2 (2021), 315\u2013329.","journal-title":"IEEE TPDS"},{"key":"e_1_3_4_40_2","doi-asserted-by":"publisher","DOI":"10.1049\/ip-sen:19982440"},{"key":"e_1_3_4_41_2","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511755309"},{"key":"e_1_3_4_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER.2017.127"},{"issue":"5","key":"e_1_3_4_43_2","first-page":"641","article-title":"Unified fault-tolerance framework for hybrid task-parallel message-passing applications","volume":"32","author":"Subasi Omer","year":"2018","unstructured":"Omer Subasi, Tatiana Martsinkevich, Ferad Zyulkyarov, Osman Unsal, Jesus Labarta, and Franck Cappello. 2018. Unified fault-tolerance framework for hybrid task-parallel message-passing applications. IJHPCA 32, 5 (2018), 641\u2013657.","journal-title":"IJHPCA"},{"key":"e_1_3_4_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2014.101"},{"key":"e_1_3_4_45_2","doi-asserted-by":"publisher","DOI":"10.1137\/0213039"},{"key":"e_1_3_4_46_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2014.05.052"},{"key":"e_1_3_4_47_2","article-title":"Analysis and modeling of time-correlated failures in large-scale distributed systems","author":"Yigitbasi Nezih","year":"2010","unstructured":"Nezih Yigitbasi, Matthieu Gallet, Derrick Kondo, Alexandru Iosup, and Dick Epema. 2010. Analysis and modeling of time-correlated failures in large-scale distributed systems. Parallel and Distributed Systems Report Series (2010).","journal-title":"Parallel and Distributed Systems Report Series"},{"key":"e_1_3_4_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/361147.361115"}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3624560","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3624560","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T17:49:46Z","timestamp":1750268986000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3624560"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,11]]},"references-count":47,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,3,31]]}},"alternative-id":["10.1145\/3624560"],"URL":"https:\/\/doi.org\/10.1145\/3624560","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"value":"2329-4949","type":"print"},{"value":"2329-4957","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,11]]},"assertion":[{"value":"2022-12-27","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-08-10","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-03-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}