{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,16]],"date-time":"2025-11-16T01:56:34Z","timestamp":1763258194460,"version":"3.37.3"},"reference-count":28,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2021,6,25]],"date-time":"2021-06-25T00:00:00Z","timestamp":1624579200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,6,25]],"date-time":"2021-06-25T00:00:00Z","timestamp":1624579200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100006690","name":"Politecnico di Milano","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100006690","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Supercomput"],"published-print":{"date-parts":[[2022,2]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Due to the increasing size of HPC machines, dealing with faults is becoming mandatory due to their high frequency. Natively, MPI cannot handle faults and it stops the execution prematurely when it finds one. With the introduction of ULFM, it is possible to continue the execution, but it requires complex integration with the application. In this paper we propose Legio, a framework that introduces fault resiliency in embarrassingly parallel MPI applications. Legio exposes its features to the application transparently, removing any integration difficulty. After a fault, the execution continues only with the non-failed processes. We also propose a hierarchical alternative, which features lower repair costs on large communicators. We evaluated our solutions on the Marconi100 cluster at CINECA with benchmarks and real-world applications, showing that the overhead introduced by the library is negligible and it does not limit the scalability properties of MPI.<\/jats:p>","DOI":"10.1007\/s11227-021-03951-w","type":"journal-article","created":{"date-parts":[[2021,6,25]],"date-time":"2021-06-25T11:02:34Z","timestamp":1624618954000},"page":"2175-2195","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Legio: fault resiliency for embarrassingly parallel MPI applications"],"prefix":"10.1007","volume":"78","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0223-2900","authenticated-orcid":false,"given":"Roberto","family":"Rocco","sequence":"first","affiliation":[]},{"given":"Davide","family":"Gadioli","sequence":"additional","affiliation":[]},{"given":"Gianluca","family":"Palermo","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,6,25]]},"reference":[{"issue":"4","key":"3951_CR1","doi-asserted-by":"publisher","first-page":"309","DOI":"10.1177\/1094342009347714","volume":"23","author":"J Dongarra","year":"2009","unstructured":"Dongarra J, Beckman P, Aerts P, Cappello F, Lippert T, Matsuoka S, Messina P, Moore T, Stevens R, Trefethen A et al (2009) The international exascale software project: a call to cooperative action by the global high-performance community. Int J High Perform Comput Appl 23(4):309\u2013322","journal-title":"Int J High Perform Comput Appl"},{"key":"3951_CR2","unstructured":"Amarasinghe S, Campbell D, Carlson W, Chien A, Dally W, Elnohazy E, Hall M, Harrison R, Harrod W, Hill K et al (2009) Exascale software study: software challenges in extreme scale systems. DARPA IPTO, Air Force Research Labs, Tech. Rep 1\u2013153"},{"key":"3951_CR3","doi-asserted-by":"crossref","unstructured":"Zheng G, Ni X, Kal\u00e9 LV (2012) A scalable double in-memory checkpoint and restart scheme towards exascale. In: IEEE\/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012). IEEE, 2012, pp 1\u20136","DOI":"10.1109\/DSNW.2012.6264677"},{"issue":"1","key":"3951_CR4","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1177\/1094342010391989","volume":"25","author":"J Dongarra","year":"2011","unstructured":"Dongarra J, Beckman P, Moore T, Aerts P, Aloisio G, Andre J-C, Barkai D, Berthou J-Y, Boku T, Braunschweig B et al (2011) The international exascale software project roadmap. Int J High Perform Comput Appl 25(1):3\u201360","journal-title":"Int J High Perform Comput Appl"},{"key":"3951_CR5","doi-asserted-by":"crossref","unstructured":"Clarke L, Glendinning I, Hempel R (1994) The mpi message passing interface standard. In: Programming environments for massively parallel distributed systems. Springer, pp 213\u2013218","DOI":"10.1007\/978-3-0348-8534-8_21"},{"issue":"2","key":"3951_CR6","doi-asserted-by":"publisher","first-page":"129","DOI":"10.1177\/1094342014522573","volume":"28","author":"M Snir","year":"2014","unstructured":"Snir M, Wisniewski RW, Abraham JA, Adve SV, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B et al (2014) Addressing failures in exascale computing. Int J High Perform Comput Appl 28(2):129\u2013173","journal-title":"Int J High Perform Comput Appl"},{"issue":"3","key":"3951_CR7","doi-asserted-by":"publisher","first-page":"319","DOI":"10.1177\/1094342006067469","volume":"20","author":"A Bouteiller","year":"2006","unstructured":"Bouteiller A, Herault T, Krawezik G, Lemarinier P, Cappello F (2006) Mpich-v project: a multiprotocol automatic fault-tolerant mpi. Int J High Perform Comput Appl 20(3):319\u2013333","journal-title":"Int J High Perform Comput Appl"},{"key":"3951_CR8","unstructured":"Ferreira K, Riesen R, Oldfield R, Stearley J, Laros J, Pedretti K, Brightwell R (2011) rmpi: increasing fault resiliency in a message-passing environment. Sandia National Laboratories, Albuquerque, NM, Tech. Rep. SAND2011-2488"},{"key":"3951_CR9","doi-asserted-by":"crossref","unstructured":"Fagg GE, Dongarra JJ (2000) Ft-mpi: Fault tolerant mpi, supporting dynamic applications in a dynamic world. In: European Parallel Virtual Machine\/Message Passing Interface Users\u2019 Group Meeting. Springer, pp 346\u2013353","DOI":"10.1007\/3-540-45255-9_47"},{"issue":"3","key":"3951_CR10","doi-asserted-by":"publisher","first-page":"244","DOI":"10.1177\/1094342013488238","volume":"27","author":"W Bland","year":"2013","unstructured":"Bland W, Bouteiller A, Herault T, Bosilca G, Dongarra J (2013) Post-failure recovery of mpi communication capability: design and rationale. Int J High Perform Comput Appl 27(3):244\u2013254","journal-title":"Int J High Perform Comput Appl"},{"key":"3951_CR11","doi-asserted-by":"crossref","unstructured":"Gamell M, Katz DS, Kolla H, Chen J, Klasky S, Parashar M (2014) Exploring automatic, online failure recovery for scientific applications at extreme scales. In: SC\u201914: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, pp 895\u2013906","DOI":"10.1109\/SC.2014.78"},{"issue":"1","key":"3951_CR12","doi-asserted-by":"publisher","first-page":"100","DOI":"10.1007\/s11227-016-1629-7","volume":"73","author":"N Losada","year":"2017","unstructured":"Losada N, Cores I, Mart\u00edn MJ, Gonz\u00e1lez P (2017) Resilient mpi applications using an application-level checkpointing framework and ulfm. J Supercomput 73(1):100\u2013113","journal-title":"J Supercomput"},{"key":"3951_CR13","doi-asserted-by":"crossref","unstructured":"Teranishi K, Heroux MA (2014) Toward local failure local recovery resilience model using mpi-ulfm. In: Proceedings of the 21st european mpi users\u2019 group meeting, pp 51\u201356","DOI":"10.1145\/2642769.2642774"},{"key":"3951_CR14","doi-asserted-by":"publisher","first-page":"24","DOI":"10.1016\/j.jpdc.2015.07.005","volume":"84","author":"S Pauli","year":"2015","unstructured":"Pauli S, Arbenz P, Schwab C (2015) Intrinsic fault tolerance of multilevel monte carlo methods. J Parallel Distrib Comput 84:24\u201336","journal-title":"J Parallel Distrib Comput"},{"key":"3951_CR15","unstructured":"\u201cExscalate4cov - exascale smart platform against pathogens.\u201d [Online]. Available: http:\/\/www.exscalate4cov.eu\/"},{"key":"3951_CR16","unstructured":"\u201cMarconi100, the new accelerated system.\u201d [Online]. Available: https:\/\/www.hpc.cineca.it\/hardware\/marconi100"},{"issue":"3","key":"3951_CR17","doi-asserted-by":"publisher","first-page":"501","DOI":"10.1109\/TPDS.2018.2866794","volume":"30","author":"F Shahzad","year":"2018","unstructured":"Shahzad F, Thies J, Kreutzer M, Zeiser T, Hager G, Wellein G (2018) Craft: A library for easier application-level checkpoint\/restart and automatic fault tolerance. IEEE Trans Parallel Distrib Syst 30(3):501\u2013514","journal-title":"IEEE Trans Parallel Distrib Syst"},{"key":"3951_CR18","doi-asserted-by":"crossref","unstructured":"Gamell M, Teranishi K, Heroux MA, Mayo J, Kolla H, Chen J, Parashar M (2015) Local recovery and failure masking for stencil-based applications at extreme scales. In: SC\u201915: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, pp 1\u201312","DOI":"10.1145\/2807591.2807672"},{"key":"3951_CR19","doi-asserted-by":"publisher","first-page":"450","DOI":"10.1016\/j.future.2018.09.041","volume":"91","author":"N Losada","year":"2019","unstructured":"Losada N, Bosilca G, Bouteiller A, Gonz\u00e1lez P, Mart\u00edn MJ (2019) Local rollback for resilient mpi applications with application-level checkpointing and message logging. Future Gener Comput Syst 91:450\u2013464","journal-title":"Future Gener Comput Syst"},{"issue":"8","key":"3951_CR20","doi-asserted-by":"publisher","first-page":"225","DOI":"10.1145\/2370036.2145845","volume":"47","author":"P Du","year":"2012","unstructured":"Du P, Bouteiller A, Bosilca G, Herault T, Dongarra J (2012) Algorithm-based fault tolerance for dense matrix factorizations. Acm sigplan notices 47(8):225\u2013234","journal-title":"Acm sigplan notices"},{"key":"3951_CR21","doi-asserted-by":"crossref","unstructured":"Kalim U, Gardner MK, Feng W A non-invasive approach for realizing resilience in mpi. In: Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, pp 1\u20138","DOI":"10.1145\/3086157.3086166"},{"key":"3951_CR22","doi-asserted-by":"crossref","unstructured":"Strazdins PE, Ali MM, Debusschere B (2016) Application fault tolerance for shrinking resources via the sparse grid combination technique. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, pp 1232\u20131238","DOI":"10.1109\/IPDPSW.2016.210"},{"key":"3951_CR23","doi-asserted-by":"crossref","unstructured":"Rizzi F, Morris K, Sargsyan K, Mycek P, Safta C, Debusschere B, LeMaitre O, Knio O (2016) Ulfm-mpi implementation of a resilient task-based partial differential equations preconditioner. In: Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, pp 19\u201326","DOI":"10.1145\/2909428.2909429"},{"key":"3951_CR24","doi-asserted-by":"publisher","first-page":"467","DOI":"10.1016\/j.future.2020.01.026","volume":"106","author":"N Losada","year":"2020","unstructured":"Losada N, Gonz\u00e1lez P, Mart\u00edn MJ, Bosilca G, Bouteiller A, Teranishi K (2020) Fault tolerance of mpi applications in exascale systems: The ulfm solution. Future Gener Comput Syst 106:467\u2013481","journal-title":"Future Gener Comput Syst"},{"key":"3951_CR25","doi-asserted-by":"crossref","unstructured":"Thakur R, Gropp W (2007) Open issues in mpi implementation. In: Asia-Pacific Conference on Advances in Computer Systems Architecture. Springer, pp 327\u2013338","DOI":"10.1007\/978-3-540-74309-5_31"},{"key":"3951_CR26","unstructured":"\u201cmpibench: Mpi benchmark to test and measure collective performance.\u201d [Online]. Available: https:\/\/github.com\/LLNL\/mpiBench"},{"key":"3951_CR27","unstructured":"Bailey D, Harris T, Saphir W, Van Der\u00a0Wijngaart R, Woo A, Yarrow M (1995) The nas parallel benchmarks 2.0. Technical Report NAS-95-020, NASA Ames Research Center, Tech. Rep"},{"key":"3951_CR28","doi-asserted-by":"crossref","unstructured":"Garg R, Price G, Cooperman G (2019) Mana for mpi: Mpi-agnostic network-agnostic transparent checkpointing. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, pp 49\u201360","DOI":"10.1145\/3307681.3325962"}],"container-title":["The Journal of Supercomputing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11227-021-03951-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11227-021-03951-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11227-021-03951-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,1,24]],"date-time":"2022-01-24T11:23:19Z","timestamp":1643023399000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11227-021-03951-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,6,25]]},"references-count":28,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,2]]}},"alternative-id":["3951"],"URL":"https:\/\/doi.org\/10.1007\/s11227-021-03951-w","relation":{},"ISSN":["0920-8542","1573-0484"],"issn-type":[{"type":"print","value":"0920-8542"},{"type":"electronic","value":"1573-0484"}],"subject":[],"published":{"date-parts":[[2021,6,25]]},"assertion":[{"value":"12 June 2021","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 June 2021","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"No funding was received for conducting this study.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Funding"}},{"value":"The authors have no conflicts of interest to declare that are relevant to the content of this article. All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflicts of interest"}}]}}