{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:13:27Z","timestamp":1750220007450,"version":"3.41.0"},"reference-count":38,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2022,12,16]],"date-time":"2022-12-16T00:00:00Z","timestamp":1671148800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2022,12,31]]},"abstract":"<jats:p>This article revisits checkpointing strategies when workflows composed of multiple tasks execute on a parallel platform. The objective is to minimize the expectation of the total execution time. For a single task, the Young\/Daly formula provides the optimal checkpointing period. However, when many tasks execute simultaneously, the risk that one of them is severely delayed increases with the number of tasks. To mitigate this risk, a possibility is to checkpoint each task more often than with the Young\/Daly strategy. But is it worth slowing each task down with extra checkpoints? Does the extra checkpointing make a difference globally? This article answers these questions. On the theoretical side, we prove several negative results for keeping the Young\/Daly period when many tasks execute concurrently, and we design novel checkpointing strategies that guarantee an efficient execution with high probability. On the practical side, we report comprehensive experiments that demonstrate the need to go beyond the Young\/Daly period and to checkpoint more often for a wide range of application\/platform settings.<\/jats:p>","DOI":"10.1145\/3548607","type":"journal-article","created":{"date-parts":[[2022,9,2]],"date-time":"2022-09-02T11:19:16Z","timestamp":1662117556000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Checkpointing Workflows \u00e0 la Young\/Daly Is Not Good Enough"],"prefix":"10.1145","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2910-3540","authenticated-orcid":false,"given":"Anne","family":"Benoit","sequence":"first","affiliation":[{"name":"Laboratoire LIP, ENS Lyon, Lyon Cedex 07, France"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9739-7440","authenticated-orcid":false,"given":"Luca","family":"Perotin","sequence":"additional","affiliation":[{"name":"Laboratoire LIP, ENS Lyon, Lyon Cedex 07, France"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2361-055X","authenticated-orcid":false,"given":"Yves","family":"Robert","sequence":"additional","affiliation":[{"name":"Laboratoire LIP, ENS Lyon, Lyon Cedex 07, France"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4379-4467","authenticated-orcid":false,"given":"Hongyang","family":"Sun","sequence":"additional","affiliation":[{"name":"University of Kansas, KS, USA"}]}],"member":"320","published-online":{"date-parts":[[2022,12,16]]},"reference":[{"unstructured":"Anne Benoit Lucas Perotin Yves Robert and Hongyang Sun. 2021. Checkpointing Workflows \u00e0 la Young\/Daly Is Not Good Enough: Code for In-house Simulator. (June2021). https:\/\/graal.ens-lyon.fr\/yrobert\/simulator.zip.","key":"e_1_3_3_2_2"},{"unstructured":"Argonne Leadership Computing Facility (ALCF). Mira Log Traces. Retrieved from https:\/\/reports.alcf.anl.gov\/data\/mira.html.","key":"e_1_3_3_3_2"},{"doi-asserted-by":"publisher","key":"e_1_3_3_4_2","DOI":"10.1016\/j.future.2017.05.041"},{"issue":"1","key":"e_1_3_3_5_2","first-page":"2","article-title":"Scheduling computational workflows on failure-prone platforms","volume":"6","author":"Aupy Guillaume","year":"2016","unstructured":"Guillaume Aupy, Anne Benoit, Henri Casanova, and Yves Robert. 2016. Scheduling computational workflows on failure-prone platforms. Int. J. Netw. Comput. 6, 1 (2016), 2\u201326.","journal-title":"Int. J. Netw. Comput."},{"doi-asserted-by":"publisher","key":"e_1_3_3_6_2","DOI":"10.1145\/2897189"},{"doi-asserted-by":"publisher","key":"e_1_3_3_7_2","DOI":"10.1007\/s00607-013-0331-3"},{"doi-asserted-by":"publisher","key":"e_1_3_3_8_2","DOI":"10.1145\/2063384.2063428"},{"issue":"1","key":"e_1_3_3_9_2","article-title":"Toward exascale resilience: 2014 update","volume":"1","author":"Cappello Franck","year":"2014","unstructured":"Franck Cappello, Al Geist, William Gropp, Sanjay Kale, Bill Kramer, and Marc Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1, 1 (2014).","journal-title":"Supercomput. Front. Innov."},{"unstructured":"F. Cappello K. Mohror et\u00a0al. 2019. VeloC: Very Low Overhead Checkpointing System. Retrieved from https:\/\/veloc.readthedocs.io\/en\/latest\/.","key":"e_1_3_3_10_2"},{"doi-asserted-by":"publisher","key":"e_1_3_3_11_2","DOI":"10.1145\/214451.214456"},{"doi-asserted-by":"publisher","key":"e_1_3_3_12_2","DOI":"10.1007\/BF00288685"},{"doi-asserted-by":"publisher","key":"e_1_3_3_13_2","DOI":"10.1016\/j.future.2004.11.016"},{"key":"e_1_3_3_14_2","first-page":"25:1\u201325:14","volume-title":"Proceedings of the Symposium on Theoretical Aspects of Computer Science (STACS\u201918)","author":"Demirci G\u00f6kalp","year":"2018","unstructured":"G\u00f6kalp Demirci, Henry Hoffmann, and David H. K. Kim. 2018. Approximation algorithms for scheduling with resource and precedence constraints. In Proceedings of the Symposium on Theoretical Aspects of Computer Science (STACS\u201918). 25:1\u201325:14."},{"doi-asserted-by":"publisher","key":"e_1_3_3_15_2","DOI":"10.1109\/DSN.2013.6575356"},{"unstructured":"Fault-Tolerance Research Hub. 2021. User Level Failure Mitigation. Retrieved from https:\/\/fault-tolerance.org.","key":"e_1_3_3_16_2"},{"doi-asserted-by":"publisher","key":"e_1_3_3_17_2","DOI":"10.1007\/BFb0022284"},{"doi-asserted-by":"publisher","key":"e_1_3_3_18_2","DOI":"10.1023\/A:1009794729459"},{"doi-asserted-by":"publisher","key":"e_1_3_3_19_2","DOI":"10.1145\/2063384.2063443"},{"doi-asserted-by":"publisher","key":"e_1_3_3_20_2","DOI":"10.1109\/WORKS51914.2020.00012"},{"key":"e_1_3_3_21_2","volume-title":"Computers and Intractability, a Guide to the Theory of NP-Completeness","author":"Garey M. R.","year":"1979","unstructured":"M. R. Garey and D. S. Johnson. 1979. Computers and Intractability, a Guide to the Theory of NP-Completeness. W. H. Freeman & Company."},{"doi-asserted-by":"publisher","key":"e_1_3_3_22_2","DOI":"10.1017\/S0963548397002939"},{"doi-asserted-by":"publisher","key":"e_1_3_3_23_2","DOI":"10.1137\/0117039"},{"doi-asserted-by":"publisher","key":"e_1_3_3_24_2","DOI":"10.1109\/TC.2018.2801300"},{"doi-asserted-by":"publisher","key":"e_1_3_3_25_2","DOI":"10.1145\/3225058.3225145"},{"doi-asserted-by":"publisher","key":"e_1_3_3_26_2","DOI":"10.5555\/2811302"},{"key":"e_1_3_3_27_2","first-page":"18","volume-title":"Proceedings of the 2nd International Workshop on Grid and Cooperative Computing (GCC\u201903)","author":"H\u00f6nig Udo","year":"2003","unstructured":"Udo H\u00f6nig and Wolfram Schiffmann. 2003. A parallel branch-and-bound algorithm for computing optimal task graph schedules. In Proceedings of the 2nd International Workshop on Grid and Cooperative Computing (GCC\u201903). 18\u201325."},{"doi-asserted-by":"publisher","key":"e_1_3_3_28_2","DOI":"10.1287\/opre.9.6.841"},{"doi-asserted-by":"publisher","key":"e_1_3_3_29_2","DOI":"10.1.0?topic=cluster-fault-tolerance"},{"doi-asserted-by":"publisher","key":"e_1_3_3_30_2","DOI":"10.1145\/1159892.1159899"},{"doi-asserted-by":"publisher","key":"e_1_3_3_31_2","DOI":"10.1145\/344588.344618"},{"doi-asserted-by":"publisher","key":"e_1_3_3_32_2","DOI":"10.1007\/3-540-44676-1_12"},{"doi-asserted-by":"publisher","key":"e_1_3_3_33_2","DOI":"10.1023\/A:1009817206440"},{"unstructured":"National Energy Research Scientific Computing Center (NERSC). Cori Log Traces. Retrieved from https:\/\/docs.nersc.gov\/systems\/cori\/.","key":"e_1_3_3_34_2"},{"issue":"1","key":"e_1_3_3_35_2","article-title":"Understanding failures in petascale computers","volume":"78","author":"Schroeder B.","year":"2007","unstructured":"B. Schroeder and G. A. Gibson. 2007. Understanding failures in petascale computers. J. Phys.: Conf. Ser. 78, 1 (2007).","journal-title":"J. Phys.: Conf. Ser."},{"doi-asserted-by":"publisher","key":"e_1_3_3_36_2","DOI":"10.1007\/s11227-010-0395-1"},{"unstructured":"Pegasus Team. 2014. Pegasus Workflow Generator. Retrieved from https:\/\/confluence.pegasus.isi.edu\/display\/pegasus\/WorkflowGenerator.","key":"e_1_3_3_37_2"},{"doi-asserted-by":"publisher","key":"e_1_3_3_38_2","DOI":"10.1137\/0213039"},{"doi-asserted-by":"publisher","key":"e_1_3_3_39_2","DOI":"10.1145\/361147.361115"}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3548607","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3548607","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:51:06Z","timestamp":1750182666000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3548607"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,16]]},"references-count":38,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2022,12,31]]}},"alternative-id":["10.1145\/3548607"],"URL":"https:\/\/doi.org\/10.1145\/3548607","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"type":"print","value":"2329-4949"},{"type":"electronic","value":"2329-4957"}],"subject":[],"published":{"date-parts":[[2022,12,16]]},"assertion":[{"value":"2021-11-18","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-07-11","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-12-16","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}