{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,21]],"date-time":"2026-02-21T18:17:30Z","timestamp":1771697850388,"version":"3.50.1"},"reference-count":44,"publisher":"SAGE Publications","issue":"4","license":[{"start":{"date-parts":[[2009,9,17]],"date-time":"2009-09-17T00:00:00Z","timestamp":1253145600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2009,11]]},"abstract":"<jats:p> Over the past few years resilience has became a major issue for high-performance computing (HPC) systems, in particular in the perspective of large petascale systems and future exascale systems. These systems will typically gather from half a million to several millions of central processing unit (CPU) cores running up to a billion threads. From the current knowledge and observations of existing large systems, it is anticipated that exascale systems will experience various kind of faults many times per day. It is also anticipated that the current approach for resilience, which relies on automatic or application level checkpoint\/ restart, will not work because the time for checkpointing and restarting will exceed the mean time to failure of a full system. This set of projections leaves the community of fault tolerance for HPC systems with a difficult challenge: finding new approaches, which are possibly radically disruptive, to run applications until their normal termination, despite the essentially unstable nature of exascale systems. Yet, the community has only five to six years to solve the problem. This white paper synthesizes the motivations, observations and research issues considered as determinant of several complimentary experts of HPC in applications, programming models, distributed systems and system management. <\/jats:p>","DOI":"10.1177\/1094342009347767","type":"journal-article","created":{"date-parts":[[2009,9,18]],"date-time":"2009-09-18T04:13:18Z","timestamp":1253247198000},"page":"374-388","source":"Crossref","is-referenced-by-count":229,"title":["Toward Exascale Resilience"],"prefix":"10.1177","volume":"23","author":[{"given":"Franck","family":"Cappello","sequence":"first","affiliation":[{"name":"INRIA, LABORATOIRE EN RECHERCHE INFORMATIQUE, FRANCE,"}]},{"given":"Al","family":"Geist","sequence":"additional","affiliation":[{"name":"OAK RIDGE NATIONAL LABORATORY, TN, USA"}]},{"given":"Bill","family":"Gropp","sequence":"additional","affiliation":[{"name":"DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF ILLINOIS\rAT URBANA-CHAMPAIGN, USA"}]},{"given":"Laxmikant","family":"Kale","sequence":"additional","affiliation":[{"name":"DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF ILLINOIS\rAT URBANA-CHAMPAIGN, USA"}]},{"given":"Bill","family":"Kramer","sequence":"additional","affiliation":[{"name":"NERSC, LAWRENCE BERKELEY NATIONAL LABORATORY, IL, USA"}]},{"given":"Marc","family":"Snir","sequence":"additional","affiliation":[{"name":"DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF ILLINOIS\rAT URBANA-CHAMPAIGN, USA"}]}],"member":"179","published-online":{"date-parts":[[2009,9,17]]},"reference":[{"key":"atypb1","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2004.2"},{"key":"atypb2","volume-title":"Proceedings of the International Supercomputing Conference (ISC 2008)","author":"Bouteiller, A."},{"key":"atypb3","doi-asserted-by":"publisher","DOI":"10.1145\/200836.200880"},{"key":"atypb4","volume-title":"BLCR","year":"2009"},{"key":"atypb5","volume-title":"Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003)","author":"Bronevetsky, G."},{"key":"atypb6","volume-title":"proceedings of the IEEE Parallel and Distributed Processing Symposium","author":"Chen, Z."},{"key":"atypb7","volume-title":"Proceedings of Supercomputing 2008","author":"Wang, C."},{"key":"atypb8","volume-title":"CIFT","year":"2009"},{"key":"atypb9","doi-asserted-by":"publisher","DOI":"10.1145\/214451.214456"},{"key":"atypb10","volume-title":"Proactive fault tolerance in large systems. HPCRI Workshop in conjunction with HPCA 2005","author":"Chakravorty, S.","year":"2005"},{"key":"atypb11","volume-title":"Proceedings of the 2004 ACM\/IEEE conference on Supercomputing","author":"Lu, C."},{"key":"atypb12","volume-title":"CSCL"},{"key":"atypb13","doi-asserted-by":"publisher","DOI":"10.1145\/361179.361202"},{"key":"atypb14","doi-asserted-by":"publisher","DOI":"10.1145\/568522.568525"},{"key":"atypb15","doi-asserted-by":"publisher","DOI":"10.1518\/001872095779049543"},{"key":"atypb16","volume-title":"Proceedings of the International Conference on Computational Science","author":"Engelman, C."},{"key":"atypb17","volume-title":"FT-MPI"},{"key":"atypb18","volume-title":"Proceedings of the 2007 ACM\/IEEE conference on Supercomputing","author":"Glosli, J.N."},{"key":"atypb19","volume-title":"Development of naturally fault tolerant algorithms for computing on 100,000 processors","author":"Geist, A.","year":"2002"},{"key":"atypb20","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1984.1676475"},{"key":"atypb21","doi-asserted-by":"publisher","DOI":"10.2172\/919272"},{"key":"atypb22","volume-title":"PERCU: A holistic method for evaluating high performance computing systems. Dissertation","author":"Kramer, W.","year":"2008"},{"key":"atypb23","volume-title":"LAM\/MPI","year":"2009"},{"key":"atypb24","doi-asserted-by":"publisher","DOI":"10.1201\/9781420040586"},{"key":"atypb25","volume-title":"Libckpt","year":"2009"},{"key":"atypb26","volume-title":"Scalable diskless checkpointing for large parallel systems. Ph.D. dissertation","author":"Lu, C.D.","year":"2005"},{"key":"atypb27","volume-title":"Proceedings of SuperComputing 2002","author":"Bosilca, G."},{"key":"atypb28","volume-title":"MVAPICH","year":"2009"},{"key":"atypb29","volume-title":"Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems","author":"Nagaraja, K."},{"key":"atypb30","volume-title":"OpenMPI","year":"2009"},{"key":"atypb31","volume-title":"Proceedings of the International Conference on Dependable Systems and Networks (DSN)","author":"Oliner, A."},{"key":"atypb32","volume-title":"PDSI","year":"2009"},{"key":"atypb33","doi-asserted-by":"publisher","DOI":"10.1109\/71.730527"},{"key":"atypb34","doi-asserted-by":"publisher","DOI":"10.1002\/(SICI)1097-024X(199902)29:2<125::AID-SPE224>3.0.CO;2-7"},{"key":"atypb35","volume-title":"Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP\u201905)","author":"Qin, F."},{"key":"atypb36","volume-title":"Proceedings of IEEE\/ACM Supercomputing 2002","author":"Sahoo, R.K."},{"key":"atypb37","volume-title":"Proceedings of IEEE DSN","author":"Liang, Y."},{"key":"atypb38","volume-title":"Proceedings of HIPC 2006, LNCS","author":"Chakravorty, S."},{"key":"atypb39","volume-title":"Poster in Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)","author":"Scott, S."},{"key":"atypb40","volume-title":"SCR","year":"2009"},{"key":"atypb41","doi-asserted-by":"crossref","unstructured":"Schroeder, B. and Gibson, G. (2007). Understanding failures in petascale computers. J Phys. Conf. 78: 012022. Teodorescu, R., Nakano, J. and Torrellas, J. (2006). SWICH: a prototype for efficient cache-level checkpointing and rollback. IEEE Micro. 26(5): 28-40.","DOI":"10.1109\/MM.2006.100"},{"key":"atypb42","doi-asserted-by":"crossref","unstructured":"Von Neuman, J. ( 1956). Probabilistic logics and the synthesis of reliable organisms from unreliable components. In Automata studies, edited by C. E. Shannon and J. McCarthy. New Jersey: Princeton University Press, pp. 43-98.","DOI":"10.1515\/9781400882618-003"},{"key":"atypb43","volume-title":"Proceedings of ICPP","author":"Yawei Li"},{"key":"atypb44","volume-title":"Proceedings of the 2004 IEEE International Conference on Cluster Computing","author":"Zheng, G."}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342009347767","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342009347767","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,28]],"date-time":"2025-01-28T11:22:27Z","timestamp":1738063347000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/1094342009347767"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,9,17]]},"references-count":44,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2009,11]]}},"alternative-id":["10.1177\/1094342009347767"],"URL":"https:\/\/doi.org\/10.1177\/1094342009347767","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"value":"1094-3420","type":"print"},{"value":"1741-2846","type":"electronic"}],"subject":[],"published":{"date-parts":[[2009,9,17]]}}}