{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,23]],"date-time":"2026-01-23T17:31:38Z","timestamp":1769189498791,"version":"3.49.0"},"reference-count":52,"publisher":"SAGE Publications","issue":"1","license":[{"start":{"date-parts":[[2025,10,3]],"date-time":"2025-10-03T00:00:00Z","timestamp":1759449600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2026,1]]},"abstract":"<jats:p>This work investigates how to protect numerical iterative algorithms from all types of errors that can strike at scale: fail-stop errors (a.k.a. failures) and silent errors, striking both as computation errors and memory bit-flips. We combine various techniques: detectors for computation errors, checksums for memory errors, and checkpoint\/restart for failures. The objective is to minimize the expected time per iteration of the algorithm. We design a hierarchical pattern that combines and interleaves all these fault-tolerance mechanisms, and we determine the optimal periodic pattern that achieves this objective. We instantiate these results for the performance analysis of the Preconditioned Conjugate Gradient (PCG) algorithm: we report several scenarios where the optimal pattern dramatically decreases the overhead due to error mitigation.<\/jats:p>","DOI":"10.1177\/10943420251379675","type":"journal-article","created":{"date-parts":[[2025,10,3]],"date-time":"2025-10-03T09:27:06Z","timestamp":1759483626000},"page":"63-79","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":1,"title":["Fault-tolerant numerical iterative algorithms at scale"],"prefix":"10.1177","volume":"40","author":[{"given":"Alix","family":"Tremodeux","sequence":"first","affiliation":[{"name":"ENS Lyon"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2910-3540","authenticated-orcid":false,"given":"Anne","family":"Benoit","sequence":"additional","affiliation":[{"name":"ENS Lyon"},{"name":"Institut Universitaire de France"}]},{"given":"Emmanuel","family":"Agullo","sequence":"additional","affiliation":[{"name":"Inria Centre at the University of Bordeaux"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6756-6189","authenticated-orcid":false,"given":"Thomas","family":"Herault","sequence":"additional","affiliation":[{"name":"Inria Centre at the University of Bordeaux"},{"name":"University Tennessee Knoxville"}]},{"given":"Luc","family":"Giraud","sequence":"additional","affiliation":[{"name":"Inria Centre at the University of Bordeaux"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2361-055X","authenticated-orcid":false,"given":"Yves","family":"Robert","sequence":"additional","affiliation":[{"name":"ENS Lyon"}]}],"member":"179","published-online":{"date-parts":[[2025,10,3]]},"reference":[{"key":"e_1_3_4_2_1","doi-asserted-by":"publisher","DOI":"10.1137\/17M1153765"},{"key":"e_1_3_4_3_1","doi-asserted-by":"publisher","DOI":"10.1137\/18M122858X"},{"key":"e_1_3_4_4_1","doi-asserted-by":"publisher","DOI":"10.1177\/10943420211055188"},{"key":"e_1_3_4_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/FTXS.2018.00008"},{"key":"e_1_3_4_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3581784.3607089"},{"key":"e_1_3_4_7_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2016.07.007"},{"key":"e_1_3_4_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2016.54"},{"key":"e_1_3_4_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2016.54"},{"key":"e_1_3_4_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2016.39"},{"key":"e_1_3_4_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2016.2643660"},{"issue":"4","key":"e_1_3_4_12_1","first-page":"2015","article-title":"Silent error detection in numerical time-stepping schemes","volume":"29","author":"Benson AR","year":"2014","unstructured":"Benson AR, Schmit S, Schreiber R (2014) Silent error detection in numerical time-stepping schemes. The International Journal of High Performance Computing Applications 29(4): 2015.","journal-title":"The International Journal of High Performance Computing Applications"},{"key":"e_1_3_4_13_1","doi-asserted-by":"publisher","DOI":"10.1063\/1.4972269"},{"key":"e_1_3_4_14_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2008.12.002"},{"key":"e_1_3_4_15_1","volume-title":"ICS","author":"Bronevetsky G","year":"2008","unstructured":"Bronevetsky G, de Supinski B (2008) Soft error vulnerability of iterative linear algebra methods. In: ICS. ACM."},{"issue":"1","key":"e_1_3_4_16_1","first-page":"5","article-title":"Toward exascale resilience: 2014 update","volume":"1","author":"Cappello F","year":"2014","unstructured":"Cappello F, Geist A, Gropp W, et al. (2014) Toward exascale resilience: 2014 update. Supercomputing frontiers and innovations 1(1): 5\u201328.","journal-title":"Supercomputing frontiers and innovations"},{"key":"e_1_3_4_17_1","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1145\/2517327.2442533","article-title":"Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods","volume":"48","author":"Chen Z","year":"2013","unstructured":"Chen Z (2013a) Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. ACM SIGPLAN Notices 48: 167\u2013176.","journal-title":"ACM SIGPLAN Notices"},{"key":"e_1_3_4_18_1","first-page":"167","article-title":"Online-abft: an online algorithm based fault tolerance scheme for soft error detection in iterative methods","volume":"48","author":"Chen Z","year":"2013","unstructured":"Chen Z (2013b) Online-abft: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming 48: 167\u2013176.","journal-title":"Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming"},{"key":"e_1_3_4_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.53"},{"key":"e_1_3_4_20_1","doi-asserted-by":"publisher","DOI":"10.1090\/S0025-5718-1968-0242392-2"},{"key":"e_1_3_4_21_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2004.11.016"},{"key":"e_1_3_4_22_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9780898718881"},{"key":"e_1_3_4_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2016.2517639"},{"key":"e_1_3_4_24_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jocs.2016.04.008"},{"key":"e_1_3_4_25_1","doi-asserted-by":"publisher","DOI":"10.56021\/9781421407944"},{"key":"e_1_3_4_26_1","doi-asserted-by":"publisher","DOI":"10.1137\/090771806"},{"key":"e_1_3_4_27_1","volume-title":"Fault-Tolerance Techniques for High-Performance Computing, Computer Communications and Networks","author":"Herault T","year":"2015","unstructured":"Herault T, Robert Y (eds) (2015) Fault-Tolerance Techniques for High-Performance Computing, Computer Communications and Networks. Springer Verlag."},{"key":"e_1_3_4_28_1","volume-title":"Fault-tolerant Iterative Methods via Selective Reliability. Research Report SAND2011-3915 C","author":"Heroux M","year":"2011","unstructured":"Heroux M, Hoemmen M (2011) Fault-tolerant Iterative Methods via Selective Reliability. Research Report SAND2011-3915 C. Sandia Nat. Lab."},{"key":"e_1_3_4_29_1","doi-asserted-by":"publisher","DOI":"10.6028\/jres.049.044"},{"key":"e_1_3_4_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1984.1676475"},{"key":"e_1_3_4_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2248487.2150989"},{"key":"e_1_3_4_32_1","volume-title":"M\u00e9thodes de d\u00e9composition de domaine. Application au calcul haute performance","author":"Jolivet P","year":"2006","unstructured":"Jolivet P (2006) M\u00e9thodes de d\u00e9composition de domaine. Application au calcul haute performance. PhD Thesis. Universit\u00e9 de Grenoble."},{"key":"e_1_3_4_33_1","doi-asserted-by":"publisher","DOI":"10.1137\/20M1376005"},{"key":"e_1_3_4_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476195"},{"key":"e_1_3_4_35_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-018-1806-7"},{"key":"e_1_3_4_36_1","doi-asserted-by":"publisher","DOI":"10.1147\/rd.62.0200"},{"key":"e_1_3_4_37_1","doi-asserted-by":"publisher","DOI":"10.1029\/96JC02776"},{"key":"e_1_3_4_38_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11075-022-01380-1"},{"key":"e_1_3_4_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/TDMR.2012.2192736"},{"key":"e_1_3_4_40_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-29082-4_4"},{"key":"e_1_3_4_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2010.18"},{"key":"e_1_3_4_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/16.278509"},{"key":"e_1_3_4_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/SBAC-PAD.2019.00040"},{"key":"e_1_3_4_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2530268.2530272"},{"key":"e_1_3_4_45_1","volume-title":"ICS","author":"Shantharam M","year":"2012","unstructured":"Shantharam M, Srinivasmurthy S, Raghavan P (2012) Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: ICS. ACM."},{"key":"e_1_3_4_46_1","volume-title":"HPC in Asia Poster, ISC","author":"Shoji F","year":"2015","unstructured":"Shoji F, Matsui S, Okamoto M, et al. (2015) Long term failure analysis of 10 peta-scale supercomputer. HPC in Asia Poster, ISC."},{"key":"e_1_3_4_47_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342014522573"},{"key":"e_1_3_4_48_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00211-013-0576-y"},{"key":"e_1_3_4_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/2694344.2694348"},{"key":"e_1_3_4_50_1","volume-title":"HPCG","author":"The TOP500 team","year":"2024","unstructured":"The TOP500 team (2024) HPCG. https:\/\/top500.org\/lists\/hpcg\/2024\/06\/"},{"key":"e_1_3_4_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/361147.361115"},{"key":"e_1_3_4_52_1","doi-asserted-by":"publisher","DOI":"10.1147\/rd.401.0003"},{"key":"e_1_3_4_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/4.658626"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10943420251379675","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/10943420251379675","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10943420251379675","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,23]],"date-time":"2026-01-23T15:43:17Z","timestamp":1769182997000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/10943420251379675"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,3]]},"references-count":52,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,1]]}},"alternative-id":["10.1177\/10943420251379675"],"URL":"https:\/\/doi.org\/10.1177\/10943420251379675","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"value":"1094-3420","type":"print"},{"value":"1741-2846","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,3]]}}}