{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T05:00:26Z","timestamp":1750309226164,"version":"3.41.0"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2024,5,21]],"date-time":"2024-05-21T00:00:00Z","timestamp":1716249600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Science Committee, Ministry of Education and Science, Republic of Kazakhstan","award":["AP09259208"],"award-info":[{"award-number":["AP09259208"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Model. Perform. Eval. Comput. Syst."],"published-print":{"date-parts":[[2024,9,30]]},"abstract":"<jats:p>A server subject to random breakdowns and repairs offers services to incoming jobs whose lengths are highly variable. A checkpointing policy is in operation, aiming to protect against possibly lengthy recovery periods by backing up the current state at periodic checkpoints. The problem of how to choose a checkpointing interval to optimise performance is addressed by analysing a general queueing model which includes breakdowns, repairs, back-ups and recoveries. Exact solutions are obtained under both Markovian and non-Markovian assumptions. Numerical experiments illustrate the conditions where checkpoints are useful and where they are not, and, in the former case, quantify the achievable benefits.<\/jats:p>","DOI":"10.1145\/3658667","type":"journal-article","created":{"date-parts":[[2024,4,13]],"date-time":"2024-04-13T07:32:42Z","timestamp":1712993562000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Checkpointing models for tasks of different types"],"prefix":"10.1145","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6190-5685","authenticated-orcid":false,"given":"Paul","family":"Ezhilchelvan","sequence":"first","affiliation":[{"name":"School of Computing, Newcastle University, Newcastle upon Tyne, United Kingdom of Great Britain and Northern Ireland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7797-7755","authenticated-orcid":false,"given":"Isi","family":"Mitrani","sequence":"additional","affiliation":[{"name":"School of Computing, Newcastle University, Newcastle upon Tyne, United Kingdom of Great Britain and Northern Ireland"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,5,21]]},"reference":[{"volume-title":"Proceedings of the ACM\/IEEE Conference on Supercomputing. 60","author":"Adiga N. R.","key":"e_1_3_2_2_2","unstructured":"N. R. Adiga, G. Almasi, G. S. Almasi, Y. Aridor, R. Barik, D. Beece, R. Bellofatto, G. Bhanot, R. Bickford, M. Blumrich, A. A. Bright, J. Brunheroto, C. Cascaval, J. Castanos, W. Chan, L. Ceze, P. Coteus, S. Chatterjee, D. Chen, G. Chiu, T. M. Cipolla, P. Crumley, K. M. Desai, A. Deutsch, T. Domany, M. B. Dombrowa, W. Donath, M. Eleftheriou, C. Erway, J. Esch, B. Fitch, J. Gagliano, A. Gara, R. Garg, R. Germain, M. E. Giampapa, B. Gopalsamy, J. Gunnels, M. Gupta, F. Gustavson, S. Hall, R. A. Haring, D. Heidel, P. Heidelberger, L. M. Herger, D. Hoenicke, R. D. Jackson, T. Jamal-Eddine, G. V. Kopcsay, E. Krevat, M. P. Kurhekar, A. P. Lanzetta, D. Lieber, L. K. Liu, M. Lu, M. Mendell, A. Misra, Y. Moatti, L. Mok, J. E. Moreira, B. J. Nathanson, M. Newton, M. Ohmacht, O. Oliner, V. Pandit, R. B. Pudota, R. Rand, R. Regan, B. Rubin, A. Ruehli, S. Rus, R. K Sahoo, A. Sanomiya, E. Schenfeld, M. Sharma, E. Shmueli, S. Singh, P. Song, V. Srinivasan, B. D. Steinmacher-Burow, K. Strauss, C. Surovic, R. Swetz, T. Takken, R. B. Tremaine, M. Tsao, A. R. Umamaheshwaran, P. Verma, P. Vranas, T. J. C. Ward, M. Wazlowski, W. Barrett, C. Engel, B. Drehmel, B. Hilgart, D. Hill, F. Kasemkhani, D. Krolak, C. T. Li, T. Liebsch, J. Marcella, A. Muff, A. Okomo, M. Rouse, A. Schram, M. Tubbs, G. Ulsh, C. Wait, J. Wittrup, M. Bae, K. Dockser, L. Kissel, M. K. Seager, J. S. Vetter, and K. Yates. 2002. An overview of the BlueGene\/L supercomputer. In Proceedings of the ACM\/IEEE Conference on Supercomputing. 60. DOI:10.1109\/SC.2002.10017"},{"key":"e_1_3_2_3_2","volume-title":"Analysis of a service facility with periodic checkpointing. Acta lnformatica 15","author":"Baccelli F.","year":"1981","unstructured":"F. Baccelli. 1981. Analysis of a service facility with periodic checkpointing. Acta lnformatica 15 (1981), 67\u201381."},{"key":"e_1_3_2_4_2","doi-asserted-by":"crossref","first-page":"881","DOI":"10.1007\/s002360050110","article-title":"Optimal fault-tolerant computing on multi-processor systems","volume":"34","author":"Bruno J. L.","year":"1997","unstructured":"J. L. Bruno and E. G. Coffman. 1997. Optimal fault-tolerant computing on multi-processor systems. Acta Informatica 34 (1997), 881\u2013904.","journal-title":"Acta Informatica"},{"key":"e_1_3_2_5_2","doi-asserted-by":"crossref","first-page":"1718","DOI":"10.14778\/3137765.3137777","article-title":"State management in Apache Flink\u00ae: Consistent stateful distributed stream processing","volume":"10","author":"Carbone P.","year":"2017","unstructured":"P. Carbone, S. Ewen, G. F\u00f3ra, S. Haridi, S. Richter, and K. Tzoumas. 2017. State management in Apache Flink\u00ae: Consistent stateful distributed stream processing. Proceedings of the VLDB Endowment 10, 12 (2017), 1718\u20131729.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_3_2_6_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3514496","article-title":"Runtime adaptation of data stream processing systems: The state of the art","volume":"54","author":"Cardellini V.","year":"2022","unstructured":"V. Cardellini, F. Lo Presti, M. Nardelli, and G. Russo. 2022. Runtime adaptation of data stream processing systems: The state of the art. ACM Computing Surveys, 54, 11 (2022), 1\u201336.","journal-title":"ACM Computing Surveys"},{"volume-title":"Proceedings of the 2019 Winter Simulation Conference. 2759\u20132770","author":"Carn\u00e1 S.","key":"e_1_3_2_7_2","unstructured":"S. Carn\u00e1, S. Ferracci, E. De Santis, A. Pellegrini, and F. Quaglia. 2019. Hardware-assisted incremental checkpointing in speculative parallel discrete event simulations. In Proceedings of the 2019 Winter Simulation Conference. 2759\u20132770."},{"key":"e_1_3_2_8_2","doi-asserted-by":"crossref","first-page":"40","DOI":"10.1109\/C-M.1975.218955","article-title":"A survey of analytic models of rollback and recovery strategies","volume":"8","author":"Chandy K. M.","year":"1975","unstructured":"K. M. Chandy. 1975. A survey of analytic models of rollback and recovery strategies. Computer 8, 5 (1975), 40\u201347.","journal-title":"Computer"},{"key":"e_1_3_2_9_2","doi-asserted-by":"crossref","first-page":"1167","DOI":"10.1109\/TSC.2018.2866421","article-title":"Uncertainty-aware online scheduling for real-time workflows in cloud service environment","volume":"14","author":"Chen H.","year":"2018","unstructured":"H. Chen, X. Zhu, G. Liu, and W. Pedrycz. 2018. Uncertainty-aware online scheduling for real-time workflows in cloud service environment. IEEE Transactions on Service Computing 14, 4 (2018), 1167\u20131178.","journal-title":"IEEE Transactions on Service Computing"},{"key":"e_1_3_2_10_2","volume-title":"Technical Report UCB\/EECS-2010-95. University of California","author":"Chen Y.","year":"2010","unstructured":"Y. Chen, A. S. Ganapathi, R. Griffith, and R. H. Katz. 2010. Analysis and Lessons from a Publicly Available Google Cluster Trace. Technical Report UCB\/EECS-2010-95. University of California, Berkeley."},{"key":"e_1_3_2_11_2","doi-asserted-by":"crossref","first-page":"38","DOI":"10.1109\/TNSE.2014.2380315","article-title":"A brief survey of PageRank algorithms","volume":"1","author":"Chung F.","year":"2014","unstructured":"F. Chung. 2014. A brief survey of PageRank algorithms. IEEE Transactions on Network Science and Engineering 1, 1 (2014), 38\u201342.","journal-title":"IEEE Transactions on Network Science and Engineering"},{"key":"e_1_3_2_12_2","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1109\/24.52636","article-title":"Optimal strategies for scheduling checkpoints and preventive maintenance","volume":"39","author":"Coffman E. G.","year":"1990","unstructured":"E. G. Coffman and E. N. Gilbert. 1990. Optimal strategies for scheduling checkpoints and preventive maintenance. IEEE Transactions on Reliability 39, 1 (1990), 9\u201318.","journal-title":"IEEE Transactions on Reliability"},{"volume-title":"The Single Server Queue. North-Holland","author":"Cohen J. W.","key":"e_1_3_2_13_2","unstructured":"J. W. Cohen. 1969. The Single Server Queue. North-Holland, Amsterdam."},{"key":"e_1_3_2_14_2","doi-asserted-by":"crossref","first-page":"313","DOI":"10.1017\/S0305004100030231","article-title":"A use of complex probabilities in the theory of stochastic processes","volume":"51","author":"Cox D. R.","year":"1955","unstructured":"D. R. Cox. 1955. A use of complex probabilities in the theory of stochastic processes. Mathematical Proceedings of the Cambridge Philosophical Society 51, 2 (1955), 313\u2013319.","journal-title":"Mathematical Proceedings of the Cambridge Philosophical Society"},{"key":"e_1_3_2_15_2","doi-asserted-by":"crossref","first-page":"1309","DOI":"10.1109\/12.61041","article-title":"Analyzing scheduled maintenance policies for repairable computer systems","volume":"39","author":"Silva E.","year":"1990","unstructured":"E. de Souza e Silva and H. R. Gail. 1990. Analyzing scheduled maintenance policies for repairable computer systems. IEEE Transactions on Computers 39, 11 (1990), 1309\u20131324.","journal-title":"IEEE Transactions on Computers"},{"key":"e_1_3_2_16_2","doi-asserted-by":"crossref","first-page":"156","DOI":"10.1016\/j.cie.2014.10.018","article-title":"A retrial queue for modeling fault-tolerant systems with checkpointing and rollback recovery","volume":"79","author":"Dimitriou I.","year":"2015","unstructured":"I. Dimitriou. 2015. A retrial queue for modeling fault-tolerant systems with checkpointing and rollback recovery. Computers & Industrial Engineering 79 (2015), 156\u2013167.","journal-title":"Computers & Industrial Engineering"},{"volume-title":"Proceedings of the 21st IEEE Symposium on Reliable Distributed Systems. 130\u2013139","author":"Dohi T.","key":"e_1_3_2_17_2","unstructured":"T. Dohi, N. Kaio, and K. S. Trivedi. 2002. Availability models with age-dependent checkpointing. In Proceedings of the 21st IEEE Symposium on Reliable Distributed Systems. 130\u2013139."},{"volume-title":"Proceedings of the IEEE International Advance Computing Conference. 1530\u20131537","author":"Duhan N.","key":"e_1_3_2_18_2","unstructured":"N. Duhan, A. K. Sharma, and K. K. Bhatia. 2009. Page ranking algorithms: A survey. In Proceedings of the IEEE International Advance Computing Conference. 1530\u20131537. DOI:10.1109\/IADCC.2009.4809246"},{"key":"e_1_3_2_19_2","doi-asserted-by":"crossref","first-page":"375","DOI":"10.1145\/568522.568525","article-title":"A survey of rollback-recovery protocols in message-passing systems","volume":"34","author":"Elnozahy E. N.","year":"2002","unstructured":"E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34, 3 (2002), 375\u2013408.","journal-title":"ACM Computing Surveys"},{"key":"e_1_3_2_20_2","doi-asserted-by":"crossref","first-page":"838","DOI":"10.1016\/j.ejor.2007.05.010","article-title":"Queueing systems with different types of server interruptions","volume":"188","author":"Fiems D.","year":"2008","unstructured":"D. Fiems, T. Maertens, and H. Bruneel. 2008. Queueing systems with different types of server interruptions. European Journal of Operational Research 188 (2008), 838\u2013845.","journal-title":"European Journal of Operational Research"},{"volume-title":"n.d. Checkpointing. Retrieved","year":"2024","key":"e_1_3_2_21_2","unstructured":"Flink. n.d. Checkpointing. Retrieved April 20, 2024 from https:\/\/nightlies.apache.org\/flink\/flink-docs-release-1.17\/docs\/dev\/datastream\/fault-tolerance\/checkpointing\/"},{"key":"e_1_3_2_22_2","doi-asserted-by":"crossref","first-page":"1117","DOI":"10.1287\/opre.33.5.1117","article-title":"Stochastic decompositions in the M\/G\/1 queue with generalized vacations","volume":"33","author":"Fuhrmann S. W.","year":"1985","unstructured":"S. W. Fuhrmann and R. B. Cooper. 1985. Stochastic decompositions in the M\/G\/1 queue with generalized vacations. Operations Research 33, 5 (1985), 1117\u20131129.","journal-title":"Operations Research"},{"volume-title":"Proceedings of the 15th International Symposium on High-Assurance Systems Engineering. 113\u2013120","author":"Garraghan P.","key":"e_1_3_2_23_2","unstructured":"P. Garraghan, P. Townend, and J. Xu. 2014. An empirical failure-analysis of a large-scale cloud computing environment. In Proceedings of the 15th International Symposium on High-Assurance Systems Engineering. 113\u2013120."},{"key":"e_1_3_2_24_2","doi-asserted-by":"crossref","first-page":"259","DOI":"10.1145\/322123.322131","article-title":"On the optimum checkpoint interval","volume":"26","author":"Gelenbe E.","year":"1979","unstructured":"E. Gelenbe. 1979. On the optimum checkpoint interval. Journal of the ACM 26, 2 (1979), 259\u2013270.","journal-title":"Journal of the ACM"},{"volume-title":"Proceedings of the 28th IEEE Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS \u201920)","author":"Gelenbe E.","key":"e_1_3_2_25_2","unstructured":"E. Gelenbe, P. Boryszko, M. Siavvas, and J. Domanska. 2020. Optimum checkpoints for time and energy. In Proceedings of the 28th IEEE Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS \u201920). 1\u20138."},{"key":"e_1_3_2_26_2","doi-asserted-by":"crossref","first-page":"72","DOI":"10.1109\/32.120317","article-title":"On the optimal checkpointing of critical tasks and transaction-oriented systems","volume":"18","author":"Grassi V.","year":"1992","unstructured":"V. Grassi, L. Donatiello, and S. Tucci. 1992. On the optimal checkpointing of critical tasks and transaction-oriented systems. IEEE Transactions on Software Engineering 18, 1 (1992), 72\u201377.","journal-title":"IEEE Transactions on Software Engineering"},{"key":"e_1_3_2_27_2","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1002\/cpe.4707","article-title":"Efficient checkpointing mechanisms for primary-backup replication on the cloud","volume":"30","author":"G\u00fcler B.","year":"2018","unstructured":"B. G\u00fcler and \u00d6. \u00d6zkasap. 2018. Efficient checkpointing mechanisms for primary-backup replication on the cloud. Concurrency and Computation: Practice and Experience 30 (2018), 21.","journal-title":"Concurrency and Computation: Practice and Experience"},{"key":"e_1_3_2_28_2","doi-asserted-by":"crossref","first-page":"699","DOI":"10.1109\/12.936236","article-title":"A variational calculus approach to optimal checkpoint placement","volume":"50","author":"Ling Y.","year":"2001","unstructured":"Y. Ling, J. Mi, and X. Lin. 2001. A variational calculus approach to optimal checkpoint placement. IEEE Transactions on Computers 50, 7 (2001), 699\u2013708.","journal-title":"IEEE Transactions on Computers"},{"volume-title":"Proceedings of the IEEE Symposium on Parallel and Distributed Processing. 1\u20139.","author":"Liu Y.","key":"e_1_3_2_29_2","unstructured":"Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun, and S. L. Scott. 2008. An optimal checkpoint\/restart model for a large scale high performance computing system. In Proceedings of the IEEE Symposium on Parallel and Distributed Processing. 1\u20139."},{"key":"e_1_3_2_30_2","doi-asserted-by":"crossref","first-page":"1196","DOI":"10.1002\/cpe.1696","article-title":"A survey on software checkpointing and mobility techniques in distributed systems","volume":"23","author":"Marzouk S.","year":"2011","unstructured":"S. Marzouk and M. Jmaiel. 2011. A survey on software checkpointing and mobility techniques in distributed systems. Concurrency and Computation: Practice and Experience 23, 11 (2011), 1196\u20131212.","journal-title":"Concurrency and Computation: Practice and Experience"},{"key":"e_1_3_2_31_2","doi-asserted-by":"crossref","unstructured":"I. Mitrani. 1998. Probabilistic Modelling. Cambridge University Press.","DOI":"10.1017\/CBO9781139173087"},{"key":"e_1_3_2_32_2","unstructured":"V. F. Nicola. 1995. Checkpointing and the modelling of program execution time. In Software Fault Tolerance M. R. Lyu (Ed.). John Wiley & Sons 167\u2013188."},{"volume-title":"Proceedings of the 20th IEEE Symposium on Reliable Distributed Systems. 14\u201323","author":"Oliveira R.","key":"e_1_3_2_33_2","unstructured":"R. Oliveira, J. Pereira, and A. Schiper. 2001. Primary-backup replication: From a time-free protocol to a time-based implementation. In Proceedings of the 20th IEEE Symposium on Reliable Distributed Systems. 14\u201323."},{"key":"e_1_3_2_34_2","doi-asserted-by":"crossref","first-page":"130","DOI":"10.1109\/TDSC.2006.22","article-title":"Distribution-free checkpoint placement algorithms based on min-max principle","volume":"3","author":"Ozaki T.","year":"2006","unstructured":"T. Ozaki, T. Dohi, H. Okamura, and N. Kaio. 2006. Distribution-free checkpoint placement algorithms based on min-max principle. IEEE Transactions on Dependable and Secure Computing 3, 2 (2006), 130\u2013140.","journal-title":"IEEE Transactions on Dependable and Secure Computing"},{"key":"e_1_3_2_35_2","doi-asserted-by":"crossref","first-page":"1570","DOI":"10.1006\/jpdc.2001.1757","article-title":"Processor allocation and checkpoint interval selection in cluster computing systems","volume":"61","author":"Plank J. S.","year":"2001","unstructured":"J. S. Plank and M. G. Thomason. 2001. Processor allocation and checkpoint interval selection in cluster computing systems. Journal of Parallel and Distributed Computing 61, 11 (2001), 1570\u20131590.","journal-title":"Journal of Parallel and Distributed Computing"},{"key":"e_1_3_2_36_2","doi-asserted-by":"crossref","first-page":"1328","DOI":"10.1109\/TC.1987.5009472","article-title":"Optimal checkpointing of real-time tasks","volume":"11","author":"Shin K. G.","year":"1987","unstructured":"K. G. Shin, T.-H. Lin, and Y.-H. Lee. 1987. Optimal checkpointing of real-time tasks. IEEE Transactions on Computers C-36, 11 (1987), 1328\u20131341.","journal-title":"IEEE Transactions on Computers C-36"},{"volume-title":"Proceedings of the IEEE Conference on Cluster Computing (CLUSTER \u201917)","author":"Subasi O.","key":"e_1_3_2_37_2","unstructured":"O. Subasi, G. Kestor, and S. Krishnamoorthy. 2017. Toward a general theory of optimal checkpoint placement. In Proceedings of the IEEE Conference on Cluster Computing (CLUSTER \u201917). 464\u2013474."},{"key":"e_1_3_2_38_2","doi-asserted-by":"crossref","first-page":"361","DOI":"10.14778\/3489496.3489515","article-title":"Scabbard: Single-node fault-tolerant stream processing","volume":"15","author":"Theodorakis G.","year":"2021","unstructured":"G. Theodorakis, F. Kounelis, P. Pietzuch, and H. Pirk. 2021. Scabbard: Single-node fault-tolerant stream processing. Proceedings of the VLDB Endowment 15, 2 (2021), 361\u2013374.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_3_2_39_2","unstructured":"B. Tuthill K. Johnson and T. Schultz. 1999. IRIX Checkpoint and Restart Operation Guide. Silicon Graphics Inc."},{"volume-title":"Proceedings of the25thInternational Symposiumon Fault-Tolerant Computing:Digest of Papers. 22\u201331","author":"Wang Y.-M.","key":"e_1_3_2_40_2","unstructured":"Y.-M. Wang, Y. Huang, K.-Ph. Vo, P.-Y. Chung, and C. Kintala. 1995. Checkpointing and its applications. In Proceedings of the25thInternational Symposiumon Fault-Tolerant Computing:Digest of Papers. 22\u201331."},{"key":"e_1_3_2_41_2","doi-asserted-by":"crossref","unstructured":"W. Whitt. 1993. Approximations for the GI\/G\/m queue. Production and Operations Management 2 2 (40) 114\u2013161.","DOI":"10.1111\/j.1937-5956.1993.tb00094.x"}],"container-title":["ACM Transactions on Modeling and Performance Evaluation of Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3658667","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3658667","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T23:44:13Z","timestamp":1750290253000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3658667"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,21]]},"references-count":40,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,9,30]]}},"alternative-id":["10.1145\/3658667"],"URL":"https:\/\/doi.org\/10.1145\/3658667","relation":{},"ISSN":["2376-3639","2376-3647"],"issn-type":[{"type":"print","value":"2376-3639"},{"type":"electronic","value":"2376-3647"}],"subject":[],"published":{"date-parts":[[2024,5,21]]},"assertion":[{"value":"2022-12-06","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-04-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-05-21","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}