{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T22:43:25Z","timestamp":1777675405524,"version":"3.51.4"},"reference-count":80,"publisher":"SAGE Publications","issue":"6","license":[{"start":{"date-parts":[[2016,9,8]],"date-time":"2016-09-08T00:00:00Z","timestamp":1473292800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2017,11]]},"abstract":"<jats:p>Exascale studies project reliability challenges for future HPC systems. We present the Global View Resilience (GVR) system, a library for portable resilience. GVR begins with a subset of the Global Arrays interface, and adds new capabilities to create versions, name versions, and compute on version data. Applications can focus versioning where and when it is most productive, and customize for each application structure independently. This control is portable, and its embedding in application source makes it natural to express and easy to maintain. The ability to name multiple versions and \u201cpartially materialize\u201d them efficiently makes ambitious forward-recovery based on \u201cdata slices\u201d across versions or data structures both easy to express and efficient. Using several large applications (OpenMC, preconditioned conjugate gradient (PCG) solver, ddcMD, and Chombo), we evaluate the programming effort to add resilience. The required changes are small (&lt; 2% lines of code (LOC)), localized and machine-independent, and perhaps most important, require no software architecture changes. We also measure the overhead of adding GVR versioning and show that overheads &lt; 2% are generally achieved. This overhead suggests that GVR can be implemented in large-scale codes and support portable error recovery with modest investment and runtime impact. Our results are drawn from both IBM BG\/Q and Cray XC30 experiments, demonstrating portability. We also present two case studies of flexible error recovery, illustrating how GVR can be used for multi-version rollback recovery, and several different forward-recovery schemes. GVR\u2019s multi-version enables applications to survive latent errors (silent data corruption) with significant detection latency, and forward recovery can make that recovery extremely efficient. Our results suggest that GVR is scalable, portable, and efficient. GVR interfaces are flexible, supporting a variety of recovery schemes, and altogether GVR embodies a gentle-slope path to tolerate growing error rates in future extreme-scale systems.<\/jats:p>","DOI":"10.1177\/1094342016664796","type":"journal-article","created":{"date-parts":[[2016,9,8]],"date-time":"2016-09-08T20:35:12Z","timestamp":1473366912000},"page":"564-590","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":8,"title":["Exploring versioned distributed arrays for resilience in scientific applications"],"prefix":"10.1177","volume":"31","author":[{"given":"A","family":"Chien","sequence":"first","affiliation":[{"name":"University of Chicago, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"P","family":"Balaji","sequence":"additional","affiliation":[{"name":"Argonne National Laboratory, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"N","family":"Dun","sequence":"additional","affiliation":[{"name":"University of Chicago, USA"},{"name":"Argonne National Laboratory, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"A","family":"Fang","sequence":"additional","affiliation":[{"name":"University of Chicago, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"H","family":"Fujita","sequence":"additional","affiliation":[{"name":"University of Chicago, USA"},{"name":"Argonne National Laboratory, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"K","family":"Iskra","sequence":"additional","affiliation":[{"name":"Argonne National Laboratory, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Z","family":"Rubenstein","sequence":"additional","affiliation":[{"name":"University of Chicago, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Z","family":"Zheng","sequence":"additional","affiliation":[{"name":"HP Vertica, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"J","family":"Hammond","sequence":"additional","affiliation":[{"name":"Intel Corp, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"I","family":"Laguna","sequence":"additional","affiliation":[{"name":"Lawrence Livermore National Laboratory, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"D","family":"Richards","sequence":"additional","affiliation":[{"name":"Lawrence Livermore National Laboratory, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"A","family":"Dubey","sequence":"additional","affiliation":[{"name":"Lawrence Berkeley National Laboratory, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"B","family":"van Straalen","sequence":"additional","affiliation":[{"name":"Lawrence Berkeley National Laboratory, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"M","family":"Hoemmen","sequence":"additional","affiliation":[{"name":"Sandia National Laboratories, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"M","family":"Heroux","sequence":"additional","affiliation":[{"name":"Sandia National Laboratories, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"K","family":"Teranishi","sequence":"additional","affiliation":[{"name":"Sandia National Laboratories, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"A","family":"Siegel","sequence":"additional","affiliation":[{"name":"Argonne National Laboratory, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"179","published-online":{"date-parts":[[2016,9,8]]},"reference":[{"key":"bibr1-1094342016664796","volume-title":"Cray User Group Proceedings","author":"Antypas K","year":"2014"},{"key":"bibr2-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1109\/PRDC.2013.10"},{"key":"bibr3-1094342016664796","unstructured":"Bariuso R, Knies A (1994) Shmem user\u2019s guide. Cray Research, Inc."},{"key":"bibr4-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063427"},{"key":"bibr5-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1016\/0021-9991(89)90035-1"},{"key":"bibr6-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1016\/0021-9991(84)90073-1"},{"key":"bibr7-1094342016664796","unstructured":"Bergman K,  (2008) Exascale computing study: Technology challenges in achieving exascale systems. DARPA IPTO Technical Report."},{"key":"bibr8-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/1941487.1941507"},{"key":"bibr9-1094342016664796","first-page":"241","author":"Bridges PG","year":"2012","journal-title":"Resilience\u201911"},{"key":"bibr10-1094342016664796","author":"Bronevetsky G","year":"2008","journal-title":"ICS"},{"key":"bibr11-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1142\/S0129626411000126"},{"key":"bibr12-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1177\/1094342009347767"},{"key":"bibr13-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/2509136.2509546"},{"key":"bibr14-1094342016664796","unstructured":"Carlson W, Draper J, Culler D, Yelick K, Brooks E, Warren K (1999) Introduction to UPC and language specification. Technical Report CCS-TR-99-157, IDA Center for Computing Sciences."},{"key":"bibr15-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1177\/1094342007078442"},{"key":"bibr16-1094342016664796","first-page":"15","author":"Chang F","year":"2006","journal-title":"OSDI \u201906"},{"key":"bibr17-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/1094811.1094852"},{"key":"bibr18-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/2442516.2442533"},{"key":"bibr19-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1016\/0021-9991(90)90233-Q"},{"key":"bibr20-1094342016664796","unstructured":"Colella P, Graves D, Keen N, Ligocki T, Martin D, McCorquodale P, Modiano D, Schwartz P, Sternberg T, Van Straalen B (2009) Chombo software package for AMR applications design document. Technical report, LBNL, Applied Numerical Algorithms Group, Computational Research Division."},{"key":"bibr21-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2004.11.016"},{"key":"bibr22-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/2049662.2049663"},{"key":"bibr23-1094342016664796","unstructured":"DeHon A, Carter N, Quinn H (2011) Final report for CCC cross-layer reliability visioning study. Available at: http:\/\/www.cra.org\/ccc\/xlayer.php."},{"key":"bibr24-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.1974.1050511"},{"key":"bibr25-1094342016664796","first-page":"610","author":"Di Martino C","year":"2014","journal-title":"DSN 2014"},{"key":"bibr26-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/2465813.2465817"},{"key":"bibr27-1094342016664796","unstructured":"Dun N, Fujita H, Tramm J, Chien AA, Siegel AR (2014) Data decomposition in Monte Carlo particle transport simulations using global view arrays. Technical Report TR-2014-09, Department of Computer Science, University of Chicago."},{"key":"bibr28-1094342016664796","doi-asserted-by":"publisher","DOI":"10.2172\/1089338"},{"key":"bibr29-1094342016664796","unstructured":"Elnozahy M (2009) System resilience at extreme scale: A white paper. DARPA Resilience Report for ITO, William Harrod."},{"key":"bibr30-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/2000064.2000108"},{"key":"bibr31-1094342016664796","unstructured":"Fang A, Chien AA (2014) Applying GVR to molecular dynamics: Enabling resilience for scientific computations. Technical Report TR-2014-04, University of Chicago."},{"key":"bibr32-1094342016664796","doi-asserted-by":"crossref","unstructured":"Fang A, Chien AA (2015) How much SSD is useful for resilience in supercomputers. Manuscript submitted for publication.","DOI":"10.1145\/2751504.2751509"},{"key":"bibr33-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063443"},{"key":"bibr34-1094342016664796","first-page":"78","author":"Fiala D","year":"2012","journal-title":"Proceedings of Supercomputing"},{"key":"bibr35-1094342016664796","first-page":"1","author":"Glosli JN","year":"2007","journal-title":"Proceedings of SC \u201907"},{"key":"bibr36-1094342016664796","volume-title":"Matrix Computations","author":"Golub GH","year":"1996","edition":"3"},{"key":"bibr37-1094342016664796","doi-asserted-by":"publisher","DOI":"10.13182\/NT11-135"},{"key":"bibr38-1094342016664796","first-page":"237","author":"Gupta R","year":"2009","journal-title":"Proceedings of ICPP \u201909"},{"key":"bibr39-1094342016664796","unstructured":"GVR Team (2014a) Global View Resilience (GVR) API documentation, version 1.0. Technical report, University of Chicago, Department of Computer Science."},{"key":"bibr40-1094342016664796","unstructured":"GVR Team (2014b) Global View Resilience (gvr) documentation, release 1.0. Technical Report TR-2014-13, University of Chicago, Department of Computer Science."},{"key":"bibr41-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1088\/1742-6596\/46\/1\/067"},{"key":"bibr42-1094342016664796","author":"Hari SKS","year":"2012","journal-title":"IPDPS"},{"key":"bibr43-1094342016664796","author":"Heroux MA","year":"2014","journal-title":"CoRR"},{"key":"bibr44-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/1089014.1089021"},{"key":"bibr45-1094342016664796","doi-asserted-by":"publisher","DOI":"10.2172\/1113870"},{"key":"bibr46-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1109\/DSNW.2012.6264674"},{"key":"bibr47-1094342016664796","author":"Hoogenboom JE","year":"2011","journal-title":"ANS M&C"},{"key":"bibr48-1094342016664796","author":"Horelik N","year":"2014","journal-title":"PHYSOR 2014 \u2013 Advances in Reactor Physics \u2013 The Role of Reactor Physics toward a Sustainable Future"},{"key":"bibr49-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1984.1676475"},{"key":"bibr50-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1016\/0021-9991(78)90098-0"},{"key":"bibr51-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/1151374.1151375"},{"key":"bibr52-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.2887"},{"key":"bibr53-1094342016664796","author":"Lidman J","year":"2012","journal-title":"FTXS\u201912"},{"key":"bibr54-1094342016664796","author":"Lu Cd","year":"2004","journal-title":"Proceedings of Supercomputing"},{"key":"bibr55-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/2465813.2465821"},{"key":"bibr56-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/2465813.2465821"},{"key":"bibr57-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1147\/rd.62.0200"},{"key":"bibr58-1094342016664796","doi-asserted-by":"publisher","DOI":"10.5516\/NET.01.2012.502"},{"key":"bibr59-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/2660193.2660231"},{"key":"bibr60-1094342016664796","doi-asserted-by":"publisher","DOI":"10.2172\/984082"},{"key":"bibr61-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1177\/1094342006064503"},{"key":"bibr62-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/289918.289920"},{"key":"bibr63-1094342016664796","unstructured":"Kogge Peter,  (2008) Exascale computing study: Technology challenges in achieving exascale systems. DARPA IPTO Study Report. Available at: http:\/\/users.ece.gatech.edu\/mrichard\/ExascaleComputingStudyReports\/exascale_final_report_100208.pdf."},{"key":"bibr64-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/356725.356729"},{"key":"bibr65-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/2501620.2501623"},{"key":"bibr66-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1016\/j.anucene.2012.06.040"},{"key":"bibr67-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2013.06.011"},{"key":"bibr68-1094342016664796","unstructured":"Rubenstein Z, Fujita H, Zheng Z, Chien A (2013) Error checking and snapshot-based recovery in a preconditioned conjugate gradient solver. Technical Report TR-2013-11, Department of Computer Science, University of Chicago."},{"key":"bibr69-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1137\/0914028"},{"key":"bibr70-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/1993498.1993518"},{"key":"bibr71-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/357369.357371"},{"key":"bibr72-1094342016664796","author":"Schroeder B","year":"2006","journal-title":"DSN"},{"key":"bibr73-1094342016664796","author":"Shantharam M","year":"2011","journal-title":"Proceedings of Supercomputing"},{"key":"bibr74-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2012.06.012"},{"key":"bibr75-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1177\/1094342014522573"},{"key":"bibr76-1094342016664796","author":"Streitz FH","year":"2005","journal-title":"SC \u201905"},{"key":"bibr77-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1088\/1742-6596\/46\/1\/037"},{"key":"bibr78-1094342016664796","author":"Sutton TM","year":"2007","journal-title":"Joint International Topical Meeting on Mathematics and Computation and Supercomputing in Nuclear Applications"},{"key":"bibr79-1094342016664796","doi-asserted-by":"publisher","DOI":"10.1145\/361147.361115"},{"key":"bibr80-1094342016664796","author":"Zheng Z","year":"2011","journal-title":"Proceedings of IPDPS"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342016664796","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/1094342016664796","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342016664796","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T08:15:28Z","timestamp":1777450528000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/1094342016664796"}},"subtitle":["global view resilience"],"short-title":[],"issued":{"date-parts":[[2016,9,8]]},"references-count":80,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2017,11]]}},"alternative-id":["10.1177\/1094342016664796"],"URL":"https:\/\/doi.org\/10.1177\/1094342016664796","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"value":"1094-3420","type":"print"},{"value":"1741-2846","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,9,8]]}}}