{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,17]],"date-time":"2026-01-17T11:09:50Z","timestamp":1768648190106,"version":"3.49.0"},"reference-count":24,"publisher":"SAGE Publications","issue":"3","license":[{"start":{"date-parts":[[2009,7,20]],"date-time":"2009-07-20T00:00:00Z","timestamp":1248048000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2009,8]]},"abstract":"<jats:p> The emergence of petascale systems and the promise of future exascale systems have reinvigorated the community interest in how to manage failures in such systems and ensure that large applications, lasting several hours or tens of hours, are completed successfully. Most of the existing results for several key mechanisms associated with fault tolerance in high-performance computing (HPC) platforms follow the rollback\u2014recovery approach. Over the last decade, these mechanisms have received a lot of attention from the community with different levels of success. Unfortunately, despite their high degree of optimization, existing approaches do not fit well with the challenging evolutions of large-scale systems. There is room and even a need for new approaches. Opportunities may come from different origins: diskless checkpointing, algorithmic-based fault tolerance, proactive operation, speculative execution, software transactional memory, forward recovery, etc. The contributions of this paper are as follows: (1) we summarize and analyze the existing results concerning the failures in large-scale computers and point out the urgent need for drastic improvements or disruptive approaches for fault tolerance in these systems; (2) we sketch most of the known opportunities and analyze their associated limitations; (3) we extract and express the challenges that the HPC community will have to face for addressing the stringent issue of failures in HPC systems. <\/jats:p>","DOI":"10.1177\/1094342009106189","type":"journal-article","created":{"date-parts":[[2009,7,20]],"date-time":"2009-07-20T11:28:49Z","timestamp":1248089329000},"page":"212-226","source":"Crossref","is-referenced-by-count":151,"title":["Fault Tolerance in Petascale\/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities"],"prefix":"10.1177","volume":"23","author":[{"given":"Franck","family":"Cappello","sequence":"first","affiliation":[{"name":"INRIA AND UIUC, THOMAS M. SIEBEL CENTER FOR COMPUTER\rSCIENCE, 201 N GOODWIN AVE, URBANA, IL 61801-2302, USA,"}]}],"member":"179","published-online":{"date-parts":[[2009,7,20]]},"reference":[{"key":"atypb1","volume-title":"MPICH-V: toward a scalable fault tolerant MPI for volatile nodes","author":"Bosilca, G.","year":"2002"},{"key":"atypb2","volume-title":"Proceedings of CCGRID'08, Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)","author":"Bouabache, F."},{"key":"atypb3","volume-title":"ISC 2008, International Supercomputing Conference","author":"Bouteiller, A."},{"key":"atypb4","volume-title":"Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003)","author":"Bronevetsky, G."},{"key":"atypb5","doi-asserted-by":"publisher","DOI":"10.1145\/214451.214456"},{"key":"atypb6","volume-title":"Proceedings of HIPC 2006 (Lecture Notes in Computer Science, Vol. 4297)","author":"Chakravorty, S."},{"key":"atypb7","volume-title":"IEEE International Symposium on Parallel and Distributed Processing, 2008 (IPDPS 2008)","author":"Chen, Z."},{"key":"atypb8","doi-asserted-by":"publisher","DOI":"10.1145\/361179.361202"},{"key":"atypb9","doi-asserted-by":"publisher","DOI":"10.1145\/568522.568525"},{"key":"atypb10","volume-title":"Development of naturally fault tolerant algorithms for computing on 100,000 processors","author":"Geist, A.","year":"2002"},{"key":"atypb11","doi-asserted-by":"publisher","DOI":"10.1007\/s10723-006-9056-2"},{"key":"atypb12","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1984.1676475"},{"key":"atypb13","volume-title":"Proceedings of ICPP 2007","author":"Li, Y."},{"key":"atypb14","volume-title":"Proceedings of IEEE DSN 2006","author":"Liang, Y."},{"key":"atypb15","doi-asserted-by":"publisher","DOI":"10.1201\/9781420040586"},{"key":"atypb16","volume-title":"Scalable diskless checkpointing for large parallel systems, PhD dissertation","author":"Lu, C.D.","year":"2005"},{"key":"atypb17","volume-title":"Proceedings of the International Conference on Dependable Systems and Networks (DSN)","author":"Oliner, A."},{"key":"atypb18","doi-asserted-by":"publisher","DOI":"10.1109\/71.730527"},{"key":"atypb19","doi-asserted-by":"publisher","DOI":"10.1002\/(SICI)1097-024X(199902)29:2<125::AID-SPE224>3.0.CO;2-7"},{"key":"atypb20","volume-title":"Proceedings of IEEE\/ ACM Supercomputing 2002","author":"Sahoo, R.K."},{"key":"atypb21","volume-title":"Poster at the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) 2009","author":"Scott, S."},{"key":"atypb22","doi-asserted-by":"publisher","DOI":"10.1088\/1742-6596\/78\/1\/012022"},{"key":"atypb23","volume-title":"Proceedings of the 10th International Parallel Processing Symposium","author":"Stellner, G."},{"key":"atypb24","volume-title":"Proceeding of Supercomputing 2008","author":"Wang, C."}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342009106189","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342009106189","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,1]],"date-time":"2025-03-01T20:12:06Z","timestamp":1740859926000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/1094342009106189"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,7,20]]},"references-count":24,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2009,8]]}},"alternative-id":["10.1177\/1094342009106189"],"URL":"https:\/\/doi.org\/10.1177\/1094342009106189","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"value":"1094-3420","type":"print"},{"value":"1741-2846","type":"electronic"}],"subject":[],"published":{"date-parts":[[2009,7,20]]}}}