{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,24]],"date-time":"2025-08-24T01:53:24Z","timestamp":1756000404637,"version":"3.38.0"},"reference-count":33,"publisher":"SAGE Publications","issue":"2","license":[{"start":{"date-parts":[[2013,10,1]],"date-time":"2013-10-01T00:00:00Z","timestamp":1380585600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2014,5]]},"abstract":"<jats:p> High performance computing applications must be resilient to faults. The traditional fault-tolerance solution is checkpoint-recovery, by which application state is saved to and recovered from secondary storage throughout execution. It has been shown that, even when using an optimal checkpointing strategy, the checkpointing overhead precludes high parallel efficiency at large scale. Additional fault-tolerance mechanisms must thus be used. Such a mechanism is replication, that is, multiple processors performing the same computation so that a processor failure does not necessarily imply an application failure. In spite of resource waste, replication can lead to higher parallel efficiency when compared to using only checkpoint-recovery at large scale. <\/jats:p><jats:p> We propose to execute and checkpoint multiple application instances concurrently, an approach we term group replication. For exponential failures we give an upper bound on the expected application execution time. This bound corresponds to a particular checkpointing period that we derive. For general failures, we propose a dynamic programming algorithm to determine non-periodic checkpoint dates as well as an empirical periodic checkpointing solution whose period is found via a numerical search. Using simulation we evaluate our proposed approaches, including comparison to the non-replication case, for both exponential and Weibull failure distributions. Our broad finding is that group replication is useful in a range of realistic application and checkpointing overhead scenarios for future exascale platforms. <\/jats:p>","DOI":"10.1177\/1094342013505348","type":"journal-article","created":{"date-parts":[[2013,10,2]],"date-time":"2013-10-02T02:28:44Z","timestamp":1380680924000},"page":"210-224","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":10,"title":["Using group replication for resilience on exascale systems"],"prefix":"10.1177","volume":"28","author":[{"given":"Marin","family":"Bougeret","sequence":"first","affiliation":[{"name":"LIRMM Montpellier, France"}]},{"given":"Henri","family":"Casanova","sequence":"additional","affiliation":[{"name":"University of Hawaii at Manoa, Honolulu, USA"}]},{"given":"Yves","family":"Robert","sequence":"additional","affiliation":[{"name":"Ecole Normale Sup\u00e9rieure de Lyon, France"},{"name":"University of Tennessee, Knoxville, USA"}]},{"given":"Fr\u00e9d\u00e9ric","family":"Vivien","sequence":"additional","affiliation":[{"name":"Ecole Normale Sup\u00e9rieure de Lyon, France"},{"name":"INRIA, France"}]},{"given":"Dounia","family":"Zaidouni","sequence":"additional","affiliation":[{"name":"Ecole Normale Sup\u00e9rieure de Lyon, France"},{"name":"INRIA, France"}]}],"member":"179","published-online":{"date-parts":[[2013,10,1]]},"reference":[{"key":"bibr1-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1145\/1465482.1465560"},{"key":"bibr2-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1137\/1.9780898719642"},{"key":"bibr3-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063428"},{"key":"bibr4-1094342013505348","unstructured":"Bougeret M, Casanova H, Robert Y, (2012) Using group replication for resilience on exascale systems. Research report no. RR-7876, INRIA, ENS Lyon, France. Available at: http:\/\/hal.inria.fr\/hal-00668016. (accessed 16 September 2013)."},{"key":"bibr5-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-14390-8_22"},{"key":"bibr6-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1147\/rd.452.0311"},{"key":"bibr7-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2012.36"},{"key":"bibr8-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2004.11.016"},{"key":"bibr9-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1177\/1094342009347714"},{"key":"bibr10-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2004.15"},{"key":"bibr11-1094342013505348","first-page":"189","volume-title":"Proceedings of the 8th IASTED international conference on parallel and distributed computing and networks","author":"Engelmann C","year":"2009"},{"key":"bibr12-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063443"},{"key":"bibr13-1094342013505348","doi-asserted-by":"publisher","DOI":"10.2172\/1081941"},{"key":"bibr14-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1145\/311531.311532"},{"key":"bibr15-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1145\/511399.511362"},{"key":"bibr16-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063444"},{"key":"bibr17-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1145\/1851476.1851509"},{"key":"bibr18-1094342013505348","first-page":"381","volume-title":"Proceedings of the international symposium on fault-tolerant computing","author":"Kolettis N","year":"1995"},{"key":"bibr19-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1007\/s10723-007-9063-y"},{"key":"bibr20-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1109\/CCGRID.2010.71"},{"key":"bibr21-1094342013505348","first-page":"1","volume-title":"Proceedings of the international parallel and distributed processing symposium","author":"Liu Y","year":"2008"},{"key":"bibr22-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1109\/MSST.2007.4367962"},{"key":"bibr23-1094342013505348","volume-title":"Scheduling: Theory, Algorithms, and Systems","author":"Pinedo M","year":"2008","edition":"3"},{"key":"bibr24-1094342013505348","unstructured":"Sarkar V, (2009) ExaScale software study: Software challenges in extreme scale systems. White Paper available at: http:\/\/users.ece.gatech.edu\/mrichard\/ExascaleComputingStudyReports\/ECSS%20report%20101909.pdf. (accessed 16 September 2013)."},{"issue":"1","key":"bibr25-1094342013505348","volume":"78","author":"Schroeder B","year":"2007","journal-title":"Journal of Physics: Conference Series"},{"key":"bibr26-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2006.5"},{"issue":"8","key":"bibr28-1094342013505348","first-page":"2690","volume":"2","author":"Venkatesh K","year":"2010","journal-title":"Analysis"},{"key":"bibr29-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2005.67"},{"key":"bibr30-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2011.106"},{"key":"bibr31-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1109\/HPCS.2010.5547140"},{"key":"bibr32-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1145\/361147.361115"},{"key":"bibr33-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1109\/DSNW.2012.6264677"},{"key":"bibr34-1094342013505348","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTR.2009.5289177"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342013505348","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/1094342013505348","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342013505348","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,2]],"date-time":"2025-03-02T23:25:54Z","timestamp":1740957954000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/1094342013505348"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2013,10,1]]},"references-count":33,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2014,5]]}},"alternative-id":["10.1177\/1094342013505348"],"URL":"https:\/\/doi.org\/10.1177\/1094342013505348","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"type":"print","value":"1094-3420"},{"type":"electronic","value":"1741-2846"}],"subject":[],"published":{"date-parts":[[2013,10,1]]}}}