{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,28]],"date-time":"2025-02-28T23:10:11Z","timestamp":1740784211506,"version":"3.38.0"},"reference-count":66,"publisher":"SAGE Publications","issue":"5","license":[{"start":{"date-parts":[[2018,12,25]],"date-time":"2018-12-25T00:00:00Z","timestamp":1545696000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2019,9]]},"abstract":"<jats:p> With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently proposed algorithm-based recovery method for multigrid iterations by introducing an adaptive control. After a fault, the healthy part of the system continues the iterative solution process, while the solution in the faulty domain is reconstructed by an asynchronous online recovery. The computations in both the faulty and the healthy subdomains must be coordinated in a sensitive way, in particular, both under- and over-solving must be avoided. Both of these waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal recoupling, we introduce a stopping criterion based on a mathematical error estimator. It involves hierarchically weighted sums of residuals within the context of uniformly refined meshes and is well-suited in the context of parallel high-performance computing. The recoupling process is steered by local contributions of the error estimator before the fault. Failure scenarios when solving up to 6.9 \u00d7 10<jats:sup>11<\/jats:sup> unknowns on more than 245,766 parallel processes will be reported on a state-of-the-art peta-scale supercomputer demonstrating the robustness of the method. <\/jats:p>","DOI":"10.1177\/1094342018817088","type":"journal-article","created":{"date-parts":[[2018,12,26]],"date-time":"2018-12-26T03:42:39Z","timestamp":1545795759000},"page":"817-837","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":3,"title":["Adaptive control in roll-forward recovery for extreme scale multigrid"],"prefix":"10.1177","volume":"33","author":[{"given":"Markus","family":"Huber","sequence":"first","affiliation":[{"name":"Technische Universit\u00e4t M\u00fcnchen, M\u00fcnchen, Germany"}]},{"given":"Ulrich","family":"R\u00fcde","sequence":"additional","affiliation":[{"name":"Friedrich-Alexander Universit\u00e4t N\u00fcrnberg-Erlangen, Erlangen, Germany"},{"name":"CERFACS, Parallel Algorithms Project, Toulouse, France"}]},{"given":"Barbara","family":"Wohlmuth","sequence":"additional","affiliation":[{"name":"Technische Universit\u00e4t M\u00fcnchen, M\u00fcnchen, Germany"}]}],"member":"179","published-online":{"date-parts":[[2018,12,25]]},"reference":[{"volume-title":"Towards Resilient Parallel Linear Krylov Solvers: Recover-Restart Strategies. Technical Report RR-8324","year":"2013","author":"Agullo E","key":"bibr1-1094342018817088"},{"key":"bibr2-1094342018817088","doi-asserted-by":"crossref","unstructured":"Ainsworth M, Glusa C (2016a) Is the multigrid method fault tolerant? The multilevel case. ArXiv e-prints 1607.08502.","DOI":"10.2172\/1333645"},{"key":"bibr3-1094342018817088","doi-asserted-by":"crossref","unstructured":"Ainsworth M, Glusa C (2016b) Is the multigrid method fault tolerant? The two-grid case. ArXiv e-prints 1607.02497.","DOI":"10.2172\/1333645"},{"key":"bibr4-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1002\/9781118032824"},{"key":"bibr5-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1177\/1094342016684006"},{"key":"bibr6-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1109\/12.9736"},{"key":"bibr7-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-40528-5_6"},{"key":"bibr8-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-40528-5_10"},{"key":"bibr9-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1016\/j.apnum.2017.07.006"},{"key":"bibr10-1094342018817088","volume-title":"Fault Injection Techniques and Tools for Embedded Systems Reliability Evaluation","author":"Benso A","year":"2010","edition":"1"},{"volume-title":"Hierarchical Hybrid Grids: Data Structures and Core Algorithms for Efficient Finite Element Simulations on Supercomputers","year":"2006","author":"Bergen BK","key":"bibr11-1094342018817088"},{"key":"bibr12-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33518-1_24"},{"key":"bibr13-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1177\/1094342013488238"},{"key":"bibr14-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1137\/0613023"},{"key":"bibr15-1094342018817088","doi-asserted-by":"publisher","DOI":"10.15803\/ijnc.5.1_2"},{"key":"bibr16-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2008.12.002"},{"key":"bibr17-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1177\/1094342006067469"},{"key":"bibr18-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1090\/S0025-5718-1990-1023042-6"},{"key":"bibr19-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1090\/conm\/157\/01415"},{"key":"bibr20-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611970753"},{"key":"bibr21-1094342018817088","unstructured":"Bridges PG, Ferreira KB, Heroux MA, et al. (2012) Fault-tolerant linear solvers via selective reliability. ArXiv e-prints 1206.1390."},{"key":"bibr22-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1177\/1094342009106189"},{"key":"bibr23-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1177\/1094342009347767"},{"key":"bibr24-1094342018817088","first-page":"1","volume":"1","author":"Cappello F","year":"2014","journal-title":"Supercomputing Frontiers and Innovations"},{"key":"bibr25-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1145\/2304576.2304590"},{"key":"bibr26-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1145\/2493123.2462920"},{"key":"bibr27-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1016\/0899-8248(89)90018-9"},{"key":"bibr28-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1177\/1094342010391989"},{"key":"bibr29-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-20943-2_1"},{"key":"bibr30-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1145\/2370036.2145845"},{"key":"bibr31-1094342018817088","doi-asserted-by":"crossref","unstructured":"Engwer C, Altenbernd M, Dreier NA, et al. (2018) A high-level C++ approach to manage local errors, asynchrony and faults in an MPI application. ArXiv e-prints 1804.04481.","DOI":"10.1109\/PDP2018.2018.00117"},{"key":"bibr32-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-58312-4_13"},{"key":"bibr33-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1002\/zamm.19980781527"},{"key":"bibr34-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2012.49"},{"key":"bibr35-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.78"},{"issue":"99","key":"bibr36-1094342018817088","first-page":"1","author":"Gamell M","year":"2017","journal-title":"IEEE Transactions on Parallel and Distributed Systems"},{"key":"bibr37-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1016\/j.camwa.2012.12.006"},{"key":"bibr38-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1016\/j.jocs.2016.06.006"},{"key":"bibr39-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.2968"},{"key":"bibr40-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2015.07.003"},{"key":"bibr41-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-662-02427-0"},{"key":"bibr42-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1145\/2465813.2465814"},{"key":"bibr43-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1109\/2.585157"},{"key":"bibr44-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1984.1676475"},{"key":"bibr45-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1137\/15M1026122"},{"key":"bibr46-1094342018817088","first-page":"165","volume-title":"Numerical Solution of Partial Differential Equations on Parallel Computers, number 51 in Lecture Notes in Computational Science and Engineering","author":"H\u00fclsemann F","year":"2005"},{"issue":"1","key":"bibr47-1094342018817088","first-page":"1","volume":"1","author":"J\u00fclich SC","year":"2015","journal-title":"Journal of Large-Scale Research Facilities"},{"key":"bibr48-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1177\/1094342018767736"},{"key":"bibr49-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1109\/MSPEC.2010.5605876"},{"key":"bibr50-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1137\/040620394"},{"key":"bibr51-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1016\/0743-7315(88)90027-5"},{"key":"bibr52-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1137\/15M1051786"},{"key":"bibr53-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2014.10.043"},{"key":"bibr54-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1109\/MSST.2007.4367962"},{"key":"bibr55-1094342018817088","doi-asserted-by":"crossref","unstructured":"Oswald P (1994) Multilevel finite element approximation. Teubner Scripts on Numerical Mathematics. B. G. Teubner, Stuttgart. Theory and applications. DOI: 10.1007\/978-3-322-91215-2.","DOI":"10.1007\/978-3-322-91215-2"},{"key":"bibr56-1094342018817088","unstructured":"Oswald P (2001) Subspace Correction Methods and Multigrid Theory. San Diego: Academic Press, pp. 553\u2013572."},{"key":"bibr57-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1145\/2503210.2503271"},{"key":"bibr58-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1137\/0730011"},{"key":"bibr59-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611970968"},{"key":"bibr60-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1145\/2807591.2807675"},{"key":"bibr61-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1137\/15M1014474"},{"key":"bibr62-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2013.145"},{"key":"bibr63-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1177\/1094342017720801"},{"key":"bibr64-1094342018817088","doi-asserted-by":"crossref","unstructured":"St\u00fcben K, Trottenberg U (1982) Multigrid Methods: Fundamental Algorithms, Model Problem analysis and Applications. Berlin: Springer, pp. 1\u2013176.","DOI":"10.1007\/BFb0069928"},{"key":"bibr65-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2012.91"},{"key":"bibr66-1094342018817088","doi-asserted-by":"publisher","DOI":"10.1109\/TCSII.2012.2231040"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342018817088","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/1094342018817088","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342018817088","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,28]],"date-time":"2025-02-28T22:05:31Z","timestamp":1740780331000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/1094342018817088"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,12,25]]},"references-count":66,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2019,9]]}},"alternative-id":["10.1177\/1094342018817088"],"URL":"https:\/\/doi.org\/10.1177\/1094342018817088","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"type":"print","value":"1094-3420"},{"type":"electronic","value":"1741-2846"}],"subject":[],"published":{"date-parts":[[2018,12,25]]}}}