{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,3,25]],"date-time":"2024-03-25T09:12:10Z","timestamp":1711357930502},"reference-count":10,"publisher":"World Scientific Pub Co Pte Ltd","issue":"04","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["J. Inter. Net."],"published-print":{"date-parts":[[2009,12]]},"abstract":"<jats:p>An efficient and reliable fault tolerance protocol plays a key role in High Performance Computing. Rollback recovery is the most common fault tolerance technique used in High Performance Computing and especially in MPI applications. This technique relies on the reliability of the checkpoint storage. Most of the rollback recovery protocols assume that the checkpoint servers machines are reliable. However, in a grid environment any unit can fail at any moment, including components used to connect different administrative domains. Such failures lead to the loss of a whole set of machines, including the more reliable machines used to store the checkpoints in this administrative domain. Thus it is not safe to rely on the high Mean Time Between Failures of specific machines to store the checkpoint images.<\/jats:p><jats:p>This paper introduces a new coordinated checkpoint protocol, which tolerates checkpoint server failures and clusters failures, and ensures a checkpoint storage reliability in a grid environment. To provide this reliability the protocol is based on a replication process. We propose new hierarchical replication strategies that exploit the locality of checkpoint images in order to minimize inter-cluster communication.<\/jats:p><jats:p>We evaluate the effectiveness of our two hierarchical replication strategies through simulations against several criteria such as topology and scalability.<\/jats:p>","DOI":"10.1142\/s0219265909002613","type":"journal-article","created":{"date-parts":[[2010,3,23]],"date-time":"2010-03-23T11:16:52Z","timestamp":1269343012000},"page":"345-364","source":"Crossref","is-referenced-by-count":0,"title":["HIERARCHICAL REPLICATION TECHNIQUES TO ENSURE CHECKPOINT STORAGE RELIABILITY IN GRID ENVIRONMENT"],"prefix":"10.1142","volume":"10","author":[{"given":"FATIHA","family":"BOUABACHE","sequence":"first","affiliation":[{"name":"INRIA Saclay le de france\/Laboratoire de Recherche en Informatique, Universite Paris Sud-XI, 91405 ORSAY, France"}]},{"given":"THOMAS","family":"HERAULT","sequence":"additional","affiliation":[{"name":"INRIA Saclay le de france\/Laboratoire de Recherche en Informatique, Universite Paris Sud-XI, 91405 ORSAY, France"}]},{"given":"GILLES","family":"FEDAK","sequence":"additional","affiliation":[{"name":"INRIA Saclay le de france\/Laboratoire de Recherche en Informatique, Universite Paris Sud-XI, 91405 ORSAY, France"}]},{"given":"FRANCK","family":"CAPPELLO","sequence":"additional","affiliation":[{"name":"INRIA Saclay le de france\/Laboratoire de Recherche en Informatique, Universite Paris Sud-XI, 91405 ORSAY, France"}]}],"member":"219","published-online":{"date-parts":[[2012,4,30]]},"reference":[{"key":"rf2","volume":"34","author":"Elnozahy E. N.","journal-title":"CSURV: Computer Surveys"},{"key":"rf3","doi-asserted-by":"publisher","DOI":"10.1145\/214451.214456"},{"key":"rf5","author":"Fischer","journal-title":"Journal of the ACM"},{"key":"rf6","author":"chandra","journal-title":"Journal of the ACM"},{"key":"rf7","author":"Kesteloot L.","journal-title":"Journal of the ACM"},{"key":"rf9","doi-asserted-by":"publisher","DOI":"10.1177\/1094342006067469"},{"key":"rf17","doi-asserted-by":"crossref","unstructured":"L. V.\u00a0Kale and S.\u00a0Krishnan, Parallel programming using C++, eds. G. V.\u00a0Wilson and P.\u00a0Lu (MIT Press, 1996)\u00a0pp. 175\u2013213.","DOI":"10.7551\/mitpress\/5241.003.0009"},{"key":"rf20","doi-asserted-by":"crossref","unstructured":"L. V.\u00a0Kale and S.\u00a0Krishnan, Parallel programming using C++, eds. G. V.\u00a0Wilson and P.\u00a0Lu (MIT Press, 1996)\u00a0pp. 175\u2013213.","DOI":"10.7551\/mitpress\/5241.003.0009"},{"key":"rf22","doi-asserted-by":"publisher","DOI":"10.1109\/2.585156"},{"key":"rf26","volume":"1","author":"Engelmann C.","journal-title":"Journal of Computers"}],"container-title":["Journal of Interconnection Networks"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.worldscientific.com\/doi\/pdf\/10.1142\/S0219265909002613","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,3,25]],"date-time":"2024-03-25T08:29:30Z","timestamp":1711355370000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.worldscientific.com\/doi\/abs\/10.1142\/S0219265909002613"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,12]]},"references-count":10,"journal-issue":{"issue":"04","published-online":{"date-parts":[[2012,4,30]]},"published-print":{"date-parts":[[2009,12]]}},"alternative-id":["10.1142\/S0219265909002613"],"URL":"https:\/\/doi.org\/10.1142\/s0219265909002613","relation":{},"ISSN":["0219-2659","1793-6713"],"issn-type":[{"value":"0219-2659","type":"print"},{"value":"1793-6713","type":"electronic"}],"subject":[],"published":{"date-parts":[[2009,12]]}}}