{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T23:03:19Z","timestamp":1777676599253,"version":"3.51.4"},"reference-count":27,"publisher":"SAGE Publications","issue":"3","license":[{"start":{"date-parts":[[2016,7,27]],"date-time":"2016-07-27T00:00:00Z","timestamp":1469577600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2016,8]]},"abstract":"<jats:p>The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master\u2013worker applications, it provides few benefits for more common bulk synchronous MPI applications. To address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.<\/jats:p>","DOI":"10.1177\/1094342015623623","type":"journal-article","created":{"date-parts":[[2016,1,13]],"date-time":"2016-01-13T00:41:02Z","timestamp":1452645662000},"page":"305-319","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":35,"title":["Evaluating and extending user-level fault tolerance in MPI applications"],"prefix":"10.1177","volume":"30","author":[{"given":"Ignacio","family":"Laguna","sequence":"first","affiliation":[{"name":"Lawrence Livermore National Laboratory, Livermore, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"David F","family":"Richards","sequence":"additional","affiliation":[{"name":"Lawrence Livermore National Laboratory, Livermore, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Todd","family":"Gamblin","sequence":"additional","affiliation":[{"name":"Lawrence Livermore National Laboratory, Livermore, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Martin","family":"Schulz","sequence":"additional","affiliation":[{"name":"Lawrence Livermore National Laboratory, Livermore, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Bronis R","family":"de Supinski","sequence":"additional","affiliation":[{"name":"Lawrence Livermore National Laboratory, Livermore, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kathryn","family":"Mohror","sequence":"additional","affiliation":[{"name":"Lawrence Livermore National Laboratory, Livermore, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Howard","family":"Pritchard","sequence":"additional","affiliation":[{"name":"Los Alamos National Laboratory, NM, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"179","published-online":{"date-parts":[[2016,7,27]]},"reference":[{"key":"bibr1-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1109\/HPDC.1999.805295"},{"key":"bibr2-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-32820-6_48"},{"key":"bibr3-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1007\/s00607-013-0331-3"},{"key":"bibr4-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2012.113"},{"key":"bibr5-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1177\/1094342009347767"},{"key":"bibr6-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1177\/1094342010391989"},{"key":"bibr7-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-45255-9_47"},{"key":"bibr8-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1145\/2464996.2465020"},{"key":"bibr9-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.78"},{"key":"bibr10-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1145\/1362622.1362700"},{"key":"bibr11-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1177\/1094342004046045"},{"key":"bibr12-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1088\/1742-6596\/46\/1\/067"},{"key":"bibr13-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.56.6811"},{"key":"bibr14-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2007.370605"},{"key":"bibr15-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-24449-0_40"},{"key":"bibr16-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1145\/165854.165874"},{"key":"bibr17-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1145\/2642769.2642775"},{"key":"bibr18-1094342015623623","author":"Message Passing Interface Forum","year":"2012","journal-title":"MPI: A Message-Passing Interface Standard, Version 3.0"},{"key":"bibr19-1094342015623623","doi-asserted-by":"publisher","DOI":"10.2172\/984082"},{"key":"bibr20-1094342015623623","volume-title":"Software Fault Tolerance Techniques and Implementation","author":"Pullum LL","year":"2001"},{"key":"bibr21-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1145\/1654059.1654121"},{"key":"bibr22-1094342015623623","volume-title":"A Survey of Checkpoint\/Restart Implementations","author":"Roman E","year":"2002"},{"key":"bibr23-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1177\/1094342005056139"},{"key":"bibr24-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2014.126"},{"key":"bibr25-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-19328-6_1"},{"key":"bibr26-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1088\/1742-6596\/46\/1\/037"},{"key":"bibr27-1094342015623623","doi-asserted-by":"publisher","DOI":"10.1145\/2642769.2642774"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342015623623","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/1094342015623623","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342015623623","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T08:19:36Z","timestamp":1777450776000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/1094342015623623"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,7,27]]},"references-count":27,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2016,8]]}},"alternative-id":["10.1177\/1094342015623623"],"URL":"https:\/\/doi.org\/10.1177\/1094342015623623","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"value":"1094-3420","type":"print"},{"value":"1741-2846","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,7,27]]}}}