{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,30]],"date-time":"2025-10-30T07:06:27Z","timestamp":1761807987115,"version":"3.38.0"},"reference-count":52,"publisher":"SAGE Publications","issue":"3","license":[{"start":{"date-parts":[[2016,7,27]],"date-time":"2016-07-27T00:00:00Z","timestamp":1469577600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2016,8]]},"abstract":"<jats:p> Ultra-large\u2013scale simulations via solving partial differential equations (PDEs) require very large computational systems for their timely solution. Studies shown the rate of failure grows with the system size, and these trends are likely to worsen in future machines. Thus, as systems, and the problems solved on them, continue to grow, the ability to survive failures is becoming a critical aspect of algorithm development. The sparse grid combination technique (SGCT) which is a cost-effective method for solving higher dimensional PDEs can be easily modified to provide algorithm-based fault tolerance. <\/jats:p><jats:p> In this article, we describe how the SGCT can produce fault-tolerant versions of the Gyrokinetic Electromagnetic Numerical Experiment plasma application, Taxila Lattice Boltzmann Method application, and Solid Fuel Ignition application. We use an alternate component grid combination formula by adding some redundancy on the SGCT to recover data from lost processes. User-level failure mitigation (ULFM) message passing interface (MPI) is used to recover the processes, and our implementation is robust over multiple failures and recovery (processes and nodes). <\/jats:p><jats:p> An acceptable degree of modification of the applications is required. Results using the 2-D SGCT show competitive execution times with acceptable error (within 0.1% to 1.0%), compared to the same simulation with a single full resolution grid. The benefits improve when the 3-D SGCT is used. Experiments show the applications ability to successfully recover from multiple failures, and applying multiple SGCT reduces the computed solution error. Process recovery via ULFM MPI increases from approximately 1.5 sec at 64 cores to approximately 5\u2009sec at 2048 cores for a one-off failure. This compares applications\u2019 built-in checkpointing with job restart in conjunction with the classical SGCT on failure, which have overheads four times as large for a single failure, excluding the recomputation overhead. An analysis for a long-running application considering recomputation times indicates a reduction in overhead of over an order of magnitude. <\/jats:p>","DOI":"10.1177\/1094342015628056","type":"journal-article","created":{"date-parts":[[2016,2,12]],"date-time":"2016-02-12T02:14:06Z","timestamp":1455243246000},"page":"335-359","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":16,"title":["Complex scientific applications made fault-tolerant with the sparse grid combination technique"],"prefix":"10.1177","volume":"30","author":[{"given":"Md Mohsin","family":"Ali","sequence":"first","affiliation":[{"name":"Research School of Computer Science, The Australian National University, Canberra, Australia"}]},{"given":"Peter E","family":"Strazdins","sequence":"additional","affiliation":[{"name":"Research School of Computer Science, The Australian National University, Canberra, Australia"}]},{"given":"Brendan","family":"Harding","sequence":"additional","affiliation":[{"name":"Mathematical Sciences Institute, The Australian National University, Canberra, Australia"}]},{"given":"Markus","family":"Hegland","sequence":"additional","affiliation":[{"name":"Mathematical Sciences Institute, The Australian National University, Canberra, Australia"}]}],"member":"179","published-online":{"date-parts":[[2016,7,27]]},"reference":[{"key":"bibr1-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2009.370"},{"key":"bibr2-1094342015628056","first-page":"40","volume-title":"Proceedings of the Third International Conference on Performance, Safety and Robustness in Complex Systems and Applications (PESARO 2013)","author":"Ali MM","year":"2013"},{"key":"bibr3-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1109\/HPCSim.2015.7237082"},{"key":"bibr4-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2014.132"},{"key":"bibr5-1094342015628056","unstructured":"Balay S, Abhyankar S, Adams MF, PETSc Web page, 2014. Available at: http:\/\/www.mcs.anl.gov\/petsc"},{"key":"bibr6-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4612-4546-9"},{"key":"bibr7-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1109\/HPCSim.2012.6266992"},{"key":"bibr8-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-36949-0_57"},{"volume-title":"Toward Message Passing Failure Management","year":"2013","author":"Bland WB","key":"bibr9-1094342015628056"},{"key":"bibr10-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33518-1_24"},{"key":"bibr11-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1017\/S0962492904000182"},{"key":"bibr12-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1007\/s10596-013-9379-6"},{"volume-title":"Computational Differential Equations","year":"1996","author":"Eriksson K","key":"bibr13-1094342015628056"},{"key":"bibr14-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-45255-9_47"},{"key":"bibr15-1094342015628056","unstructured":"Fault Tolerance Working Group. Run-through stabilization interfaces and semantics. Available at: svn.mpi-forum.org\/trac\/mpi-forum-web\/wiki\/ft\/run_through_stabilization (accessed 14 December 2013)."},{"key":"bibr16-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1006\/jcph.2000.6627"},{"issue":"4","key":"bibr17-1094342015628056","first-page":"4","volume":"3","author":"Gibson G","year":"2007","journal-title":"Software Enabling Technologies for Petascale Science"},{"volume-title":"Multiscale Effects in Plasma Microturbulence","year":"2009","author":"G\u00f6rler T","key":"bibr18-1094342015628056"},{"key":"bibr19-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2011.05.034"},{"key":"bibr20-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1142\/S0129626492000180"},{"key":"bibr21-1094342015628056","first-page":"34","volume":"52","author":"Griebel M","year":"1996","journal-title":"Proceedings of Flow Simulation on High Performance Computers II, Notes on Numerical Fluid Mechanics"},{"key":"bibr22-1094342015628056","first-page":"263","volume-title":"Proceedings of Iterative Methods in Linear Algebra","author":"Griebel M","year":"1992"},{"key":"bibr23-1094342015628056","first-page":"584","volume-title":"Proceedings of the International Conference on Parallel Computing, (ParCo 2013)","author":"Harding B","year":"2013"},{"key":"bibr24-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1137\/140964448"},{"key":"bibr25-1094342015628056","first-page":"574","volume-title":"Proceedings of the International Conference on Parallel Computing (ParCo 2013)","author":"Heene M","year":"2013"},{"key":"bibr26-1094342015628056","first-page":"564","volume-title":"Proceedings of the International Conference on Parallel Computing (ParCo 2013)","author":"Hupp P","year":"2013"},{"key":"bibr27-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.308"},{"key":"bibr28-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2007.370605"},{"key":"bibr29-1094342015628056","unstructured":"Jenko F and the GENE development team (2014) The GENE code, October 2014. Available at: http:\/\/www.gene.rzg.mpg.de (accessed 13 October 2014)."},{"key":"bibr30-1094342015628056","unstructured":"IPM: Integrated performance monitoring. Available at: http:\/\/ipm-hpc.sourceforge.net\/ (accessed 25 January 2013)."},{"key":"bibr31-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-31703-3_10"},{"key":"bibr32-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2013.05.176"},{"key":"bibr33-1094342015628056","first-page":"878","volume-title":"Proceedings of Supercomputing","author":"Message Passing Interface Forum","year":"1993"},{"key":"bibr34-1094342015628056","unstructured":"NCI: National computational infrastructure. Available at: http:\/\/nci.org.au\/raijin\/ (accessed 27 January 2015)."},{"key":"bibr35-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-22997-3_3"},{"key":"bibr36-1094342015628056","first-page":"471","volume-title":"Proceedings of the International Conference on Parallel Computing, (ParCo 2013)","author":"Pauli S","year":"2013"},{"key":"bibr37-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-14313-2_48"},{"key":"bibr38-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1109\/71.730527"},{"key":"bibr39-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.86.036701"},{"key":"bibr40-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2014.126"},{"key":"bibr41-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2006.5"},{"key":"bibr42-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.47.1815"},{"key":"bibr43-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1177\/1094342006064482"},{"key":"bibr44-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1177\/1094342014522573"},{"key":"bibr45-1094342015628056","unstructured":"Solving the Bratu (SFI\u2014solid fuel ignition) problem in a 2D rectangular domain (2015a) Available at: http:\/\/www.mcs.anl.gov\/petsc\/petsc-2.2.0\/src\/snes\/examples\/tutorials\/ex5f90.F.html (accessed 16 January 2015)."},{"key":"bibr46-1094342015628056","unstructured":"Solving the Bratu (SFI\u2014solid fuel ignition) problem in a 3D rectangular domain (2015b) Available at: http:\/\/www.mcs.anl.gov\/petsc\/petsc-2.2.0\/src\/snes\/examples\/tutorials\/ex14.c.html (accessed 16 January 2015)."},{"key":"bibr47-1094342015628056","doi-asserted-by":"crossref","unstructured":"Strazdins PE, Ali MM, Harding B (2016) Design and analysis of two highly scalable sparse grid combination algorithms. Journal of Computational Science. Special Issue on Recent Advances in Parallel Techniques for Scientific Computing. (Submitted for Review). Available at: http:\/\/hdl.handle.net\/1885\/95531","DOI":"10.1016\/j.jocs.2016.06.004"},{"key":"bibr48-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2015.76"},{"key":"bibr49-1094342015628056","unstructured":"Taxila LBM website (2015) Available at: https:\/\/software.lanl.gov\/taxila\/ (accessed 12 February 2015)."},{"key":"bibr50-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1145\/2642769.2642774"},{"volume-title":"Proceedings of the 10th LCI International Conference on High-Performance Clustered Computing","year":"2009","author":"Wright NJ","key":"bibr51-1094342015628056"},{"key":"bibr52-1094342015628056","doi-asserted-by":"publisher","DOI":"10.1145\/361147.361115"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342015628056","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/1094342015628056","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342015628056","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,1]],"date-time":"2025-03-01T23:39:05Z","timestamp":1740872345000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/1094342015628056"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,7,27]]},"references-count":52,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2016,8]]}},"alternative-id":["10.1177\/1094342015628056"],"URL":"https:\/\/doi.org\/10.1177\/1094342015628056","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"type":"print","value":"1094-3420"},{"type":"electronic","value":"1741-2846"}],"subject":[],"published":{"date-parts":[[2016,7,27]]}}}