{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,27]],"date-time":"2025-11-27T13:45:49Z","timestamp":1764251149209,"version":"3.38.0"},"reference-count":38,"publisher":"SAGE Publications","issue":"3","license":[{"start":{"date-parts":[[2006,8,1]],"date-time":"2006-08-01T00:00:00Z","timestamp":1154390400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2006,8]]},"abstract":"<jats:p> High performance computing platforms such as Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing libraries in HPC applications. These two trends raise the need for fault-tolerant MPI. The MPICH-V project focuses on designing, implementing and comparing several automatic fault-tolerant protocols for MPI applications. We present an extensive related work section highlighting the originality of our approach and the proposed protocols. We then present four fault-tolerant protocols implemented in a new generic framework for fault-tolerant protocol comparison, covering a large spectrum of known approaches from coordinated checkpoint, to uncoordinated checkpoint associated with causal message logging. We measure the performance of these protocols on a micro-benchmark and compare them with the NAS benchmark, using an original fault tolerance test. Finally, we outline the lessons learned from this in depth fault-tolerant protocol comparison of MPI applications. <\/jats:p>","DOI":"10.1177\/1094342006067469","type":"journal-article","created":{"date-parts":[[2006,8,7]],"date-time":"2006-08-07T11:32:42Z","timestamp":1154950362000},"page":"319-333","source":"Crossref","is-referenced-by-count":88,"title":["MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI"],"prefix":"10.1177","volume":"20","author":[{"given":"A.","family":"Bouteiller","sequence":"first","affiliation":[{"name":"INRIA\/LRI, Universit\u00e9 Paris-Sud, Orsay, France"}]},{"given":"T.","family":"Herault","sequence":"additional","affiliation":[]},{"given":"G.","family":"Krawezik","sequence":"additional","affiliation":[]},{"given":"P.","family":"Lemarinier","sequence":"additional","affiliation":[]},{"given":"F.","family":"Cappello","sequence":"additional","affiliation":[{"name":"INRIA\/LRI, Universit\u00e9 Paris-Sud, Orsay, France"}]}],"member":"179","published-online":{"date-parts":[[2006,8,1]]},"reference":[{"key":"atypb1","doi-asserted-by":"publisher","DOI":"10.1109\/HPDC.1999.805295"},{"key":"atypb2","doi-asserted-by":"publisher","DOI":"10.1109\/FTCS.1999.781058"},{"key":"atypb3","doi-asserted-by":"publisher","DOI":"10.1109\/ICDCS.1995.500024"},{"volume-title":"The NAS Parallel Benchmarks 2.0. Report NAS-95-020","year":"1995","author":"Bailey, D.","key":"atypb4"},{"volume-title":"Proceedings of the 1st International Symposium of Cluster Computing and the Grid (CCGRID2001)","author":"Batchu, R.","key":"atypb5"},{"key":"atypb6","first-page":"348","volume-title":"17th Symposium on Reliable Distributed Systems (SRDS'98)","author":"Bhatia, K.","year":"1998"},{"volume-title":"High Performance Networking and Computing (SC2002)","year":"2002","author":"Bosilca, G.","key":"atypb7"},{"key":"atypb8","doi-asserted-by":"publisher","DOI":"10.1145\/1048935.1050176"},{"volume-title":"IEEE\/ACM High Performance Networking and Computing (SC 2003)","year":"2003","author":"Bouteiller, A.","key":"atypb9"},{"key":"atypb10","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTR.2003.1253321"},{"key":"atypb11","first-page":"379","author":"Burns, G.","year":"1994","journal-title":"Proceedings of Supercomputing Symposium"},{"key":"atypb12","doi-asserted-by":"publisher","DOI":"10.1145\/214451.214456"},{"key":"atypb13","doi-asserted-by":"publisher","DOI":"10.1145\/509593.509626"},{"key":"atypb14","doi-asserted-by":"publisher","DOI":"10.1109\/FTCS.1992.243619"},{"key":"atypb15","doi-asserted-by":"publisher","DOI":"10.1109\/12.142678"},{"key":"atypb16","doi-asserted-by":"publisher","DOI":"10.1145\/568522.568525"},{"key":"atypb17","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-45255-9_47"},{"key":"atypb18","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-8191(01)00100-4"},{"key":"atypb19","doi-asserted-by":"publisher","DOI":"10.1177\/1094342004046045"},{"key":"atypb20","doi-asserted-by":"publisher","DOI":"10.1016\/0167-8191(96)00024-5"},{"volume-title":"The 17th Annual International Symposium on Fault-tolerant Computing (FTCS'87)","year":"1987","author":"Johnson, D. B.","key":"atypb21"},{"key":"atypb22","doi-asserted-by":"publisher","DOI":"10.1109\/ICDCS.1991.148709"},{"key":"atypb23","first-page":"19","volume-title":"17th Symposium on Reliable Distributed Systems (SRDS 1998)","author":"Lee, B.","year":"1998"},{"key":"atypb24","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTR.2004.1392609"},{"volume-title":"Checkpoint and migration of UNIX processes in the condor distributed processing system","year":"1997","author":"Litzkow, M.","key":"atypb25"},{"issue":"4","key":"atypb26","volume":"10","author":"Louca, S.","year":"2000","journal-title":"Parallel Processing Letters(PPL)"},{"key":"atypb27","doi-asserted-by":"publisher","DOI":"10.1006\/jpdc.2001.1757"},{"key":"atypb28","doi-asserted-by":"publisher","DOI":"10.1109\/FTCS.1998.689454"},{"key":"atypb29","unstructured":"Pruitt, P. N. 1998. An Asynchronous Checkpoint and Rollback Facility for\n                    Distributed Computations. PhD thesis, College of William and Mary in Virginia."},{"key":"atypb30","doi-asserted-by":"publisher","DOI":"10.1109\/RELDIS.1998.740469"},{"key":"atypb31","doi-asserted-by":"publisher","DOI":"10.1109\/FTCS.1999.781033"},{"volume-title":"Proceedings, LACSI Symposium","author":"Sankaran, S.","key":"atypb32"},{"volume-title":"IASTED International Conference on Intelligent Information Management and Systems","author":"Snell, Q.","key":"atypb33"},{"volume-title":"MPI: The Complete Reference","year":"1996","author":"Snir, M.","key":"atypb34"},{"key":"atypb35","doi-asserted-by":"publisher","DOI":"10.1109\/IPPS.1996.508106"},{"key":"atypb36","doi-asserted-by":"publisher","DOI":"10.1145\/3959.3962"},{"key":"atypb37","doi-asserted-by":"publisher","DOI":"10.1109\/FTCS.1988.5295"},{"key":"atypb38","doi-asserted-by":"publisher","DOI":"10.1109\/HPDC.1993.263838"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342006067469","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342006067469","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,4]],"date-time":"2025-03-04T12:36:22Z","timestamp":1741091782000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/1094342006067469"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2006,8]]},"references-count":38,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2006,8]]}},"alternative-id":["10.1177\/1094342006067469"],"URL":"https:\/\/doi.org\/10.1177\/1094342006067469","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"type":"print","value":"1094-3420"},{"type":"electronic","value":"1741-2846"}],"subject":[],"published":{"date-parts":[[2006,8]]}}}