{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T22:54:44Z","timestamp":1775084084927,"version":"3.50.1"},"reference-count":168,"publisher":"SAGE Publications","issue":"2","license":[{"start":{"date-parts":[[2014,3,21]],"date-time":"2014-03-21T00:00:00Z","timestamp":1395360000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2014,5]]},"abstract":"<jats:p> We present here a report produced by a workshop on \u2018Addressing failures in exascale computing\u2019 held in Park City, Utah, 4\u201311 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach. <\/jats:p><jats:p> The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions. <\/jats:p>","DOI":"10.1177\/1094342014522573","type":"journal-article","created":{"date-parts":[[2014,3,22]],"date-time":"2014-03-22T04:16:45Z","timestamp":1395461805000},"page":"129-173","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":267,"title":["Addressing failures in exascale computing"],"prefix":"10.1177","volume":"28","author":[{"given":"Marc","family":"Snir","sequence":"first","affiliation":[{"name":"Argonne National Laboratory, IL, USA"}]},{"given":"Robert W","family":"Wisniewski","sequence":"additional","affiliation":[{"name":"Intel Corporation, CA, USA"}]},{"given":"Jacob A","family":"Abraham","sequence":"additional","affiliation":[{"name":"University of Texas at Austin, TX, USA"}]},{"given":"Sarita V","family":"Adve","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, IL, USA"}]},{"given":"Saurabh","family":"Bagchi","sequence":"additional","affiliation":[{"name":"Purdue University, IN, USA"}]},{"given":"Pavan","family":"Balaji","sequence":"additional","affiliation":[{"name":"Argonne National Laboratory, IL, USA"}]},{"given":"Jim","family":"Belak","sequence":"additional","affiliation":[{"name":"Lawrence Livermore National Laboratory, CA, USA"}]},{"given":"Pradip","family":"Bose","sequence":"additional","affiliation":[{"name":"IBM T.J. Watson Research Center, NY, USA"}]},{"given":"Franck","family":"Cappello","sequence":"additional","affiliation":[{"name":"Argonne National Laboratory, IL, USA"}]},{"given":"Bill","family":"Carlson","sequence":"additional","affiliation":[{"name":"IDA Center for Computing Sciences, MD, USA"}]},{"given":"Andrew A","family":"Chien","sequence":"additional","affiliation":[{"name":"The University of Chicago, IL, USA"}]},{"given":"Paul","family":"Coteus","sequence":"additional","affiliation":[{"name":"IBM T.J. Watson Research Center, NY, USA"}]},{"given":"Nathan A","family":"DeBardeleben","sequence":"additional","affiliation":[{"name":"Los Alamos National Laboratory, NM, USA"}]},{"given":"Pedro C","family":"Diniz","sequence":"additional","affiliation":[{"name":"USC Information Sciences Institute, CA, USA"}]},{"given":"Christian","family":"Engelmann","sequence":"additional","affiliation":[{"name":"Oak Ridge National Laboratory, TN, USA"}]},{"given":"Mattan","family":"Erez","sequence":"additional","affiliation":[{"name":"University of Texas at Austin, TX, USA"}]},{"given":"Saverio","family":"Fazzari","sequence":"additional","affiliation":[{"name":"Booz Allen Hamilton, VA, USA"}]},{"given":"Al","family":"Geist","sequence":"additional","affiliation":[{"name":"Oak Ridge National Laboratory, TN, USA"}]},{"given":"Rinku","family":"Gupta","sequence":"additional","affiliation":[{"name":"Argonne National Laboratory, IL, USA"}]},{"given":"Fred","family":"Johnson","sequence":"additional","affiliation":[{"name":"SAIC, VA, USA"}]},{"given":"Sriram","family":"Krishnamoorthy","sequence":"additional","affiliation":[{"name":"Pacific Northwest National Laboratory, WA, USA"}]},{"given":"Sven","family":"Leyffer","sequence":"additional","affiliation":[{"name":"Argonne National Laboratory, IL, USA"}]},{"given":"Dean","family":"Liberty","sequence":"additional","affiliation":[{"name":"Advanced Micro Devices, MA, USA"}]},{"given":"Subhasish","family":"Mitra","sequence":"additional","affiliation":[{"name":"Stanford University, CA, USA"}]},{"given":"Todd","family":"Munson","sequence":"additional","affiliation":[{"name":"Argonne National Laboratory, IL, USA"}]},{"given":"Rob","family":"Schreiber","sequence":"additional","affiliation":[{"name":"Hewlett Packard, CA, USA"}]},{"given":"Jon","family":"Stearley","sequence":"additional","affiliation":[{"name":"Sandia National Laboratory, NM, USA"}]},{"given":"Eric Van","family":"Hensbergen","sequence":"additional","affiliation":[{"name":"ARM Inc., TX, USA"}]}],"member":"179","published-online":{"date-parts":[[2014,3,21]]},"reference":[{"key":"bibr1-1094342014522573","first-page":"529","author":"Agostinelli M","year":"2005","journal-title":"Proceedings of the 2005 IEEE international reliability physics symposium (IRPS)"},{"key":"bibr2-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1145\/1654059.1654104"},{"key":"bibr3-1094342014522573","first-page":"196","author":"Austin TM","year":"1999","journal-title":"Proceedings of the annual international symposium on microarchitecture (MICRO)"},{"key":"bibr4-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1973.5009108"},{"key":"bibr5-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2004.2"},{"key":"bibr6-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2006.58"},{"key":"bibr7-1094342014522573","unstructured":"Bailey FR, Bell G, Blondin J, (2007) Petascale metrics panel report. Available at: http:\/\/research.microsoft.com\/en-us\/um\/people\/gbell\/supers\/ascac_petascale_metrics_panel_report_and_executive_summary_2007-02-12.pdf (accessed 25 February 2014)"},{"key":"bibr8-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1002\/bltj.21543"},{"key":"bibr9-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1986.1676762"},{"key":"bibr10-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/12.57055"},{"key":"bibr11-1094342014522573","author":"Bautista-Gomez LA","year":"2011","journal-title":"International conference for high-performance computing, networking, storage and analysis (SC)"},{"key":"bibr12-1094342014522573","author":"Bautista-Gomez L","year":"2011","journal-title":"International conference for high-performance computing, networking, storage and analysis (SC)"},{"key":"bibr13-1094342014522573","volume-title":"Introduction to Stochastic Programming","author":"Birge J","year":"1997"},{"key":"bibr14-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33518-1_24"},{"key":"bibr15-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2005.110"},{"key":"bibr16-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2008.12.002"},{"key":"bibr17-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-23397-5_6"},{"key":"bibr18-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1177\/1094342006067469"},{"key":"bibr19-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1145\/1250727.1250728"},{"key":"bibr20-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2010.5544927"},{"key":"bibr21-1094342014522573","unstructured":"Cai K, Qin Z, Memory Device with Soft-Decision Decoding. US Patent 20130107611 A1, May 2, 2013."},{"key":"bibr22-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1177\/1094342009347767"},{"key":"bibr23-1094342014522573","first-page":"1","author":"Cappello F","year":"2010","journal-title":"Proceedings of the 19th international conference on computer communications and networks (ICCCN)"},{"key":"bibr24-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/TEST.2005.1584030"},{"key":"bibr25-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1147\/rd.452.0311"},{"key":"bibr26-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/VTEST.1998.670858"},{"key":"bibr27-1094342014522573","author":"Chen D","year":"2011","journal-title":"International conference for high-performance computing, networking, storage and analysis (SC)"},{"key":"bibr28-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2006.1639333"},{"key":"bibr29-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1201\/9781420010305"},{"key":"bibr30-1094342014522573","author":"Chung J","year":"2012","journal-title":"International conference for high-performance computing, networking, storage and analysis (SC)"},{"key":"bibr31-1094342014522573","volume-title":"Trust-Region Methods","author":"Conn AR","year":"1987"},{"key":"bibr32-1094342014522573","unstructured":"Daly J, Adolf B, Borkar S, (2012) Inter agency workshop on HPC resilience at extreme scale. Available at: http:\/\/institutes.lanl.gov\/resilience\/docs\/Inter-AgencyResilienceReport.pdf (accessed 25 February 2014)."},{"key":"bibr33-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2004.11.016"},{"key":"bibr34-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1145\/1327452.1327492"},{"key":"bibr35-1094342014522573","unstructured":"DeBardeleben N, Laros J, Daly J, (2010b) High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development. Technical Report LA-UR-10-00030, DARPA, VA. available at http:\/\/www.csm.ornl.gov\/~engelman\/publications\/debardeleben09high-end 2\/25\/14"},{"key":"bibr36-1094342014522573","unstructured":"DeHon A, Carter N, Quinn H (eds) (2011) Final report for CCC cross-layer reliability visioning study. 3 March Available at: http:\/\/xlayer.org\/FinalReport (accessed 25 February 2014)."},{"key":"bibr37-1094342014522573","first-page":"73","author":"Dimitrov M","year":"2007","journal-title":"Proceedings of the conference on parallel architecture and compilation techniques"},{"key":"bibr38-1094342014522573","unstructured":"Dixit A, Heald R, Wood A (2009) Trends from ten years of soft error experimentation. In: The workshop on silicon Available at: http:\/\/softerrors.info\/selse\/images\/selse_2009\/Papers\/selse5_submission_29.pdf (acessed 25 February 2014)."},{"key":"bibr39-1094342014522573","unstructured":"Dongarra J, Beckman P, Moore T, The international exascale software project roadmap International Journal of High Performance Computing Applications, 25(1), 3\u201360, 2011."},{"key":"bibr40-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1002\/j.1538-7305.1964.tb04118.x"},{"key":"bibr41-1094342014522573","first-page":"225","author":"Du P","year":"2012","journal-title":"Proceedings of the 17th ACM SIGPLAN symposium on principles and practice of parallel programming"},{"key":"bibr42-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1145\/568522.568525"},{"key":"bibr43-1094342014522573","unstructured":"Elnozahy (editor) System Resilience at Extreme Scale White Paper available at http:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?rep=rep1&type=pdf&doi=10.1.1.205.4240 accessed 2\/25\/14"},{"key":"bibr44-1094342014522573","unstructured":"EMC (2014) Smarts: Automated IT management enabling service assurance. Available at: http:\/\/www.emc.com\/it-management\/smarts\/index.htm (accessed 25 February 2014)."},{"key":"bibr45-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1016\/j.scico.2007.01.015"},{"key":"bibr46-1094342014522573","unstructured":"Fadden S (2012) An introduction to GPFS version 3.5. Available at: www-03.ibm.com\/systems\/jo\/resources\/introduction-to-gpfs-3-5.pdf (accessed 25 February 2014)."},{"key":"bibr47-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-45255-9_47"},{"key":"bibr48-1094342014522573","first-page":"385","author":"Feng S","year":"2010","journal-title":"Proceedings of the international conference on architectural support for programming languages and operating systems (ASPLOS)"},{"key":"bibr49-1094342014522573","author":"Ferreira KB","year":"2011","journal-title":"International conference for high-performance computing, networking, storage and analysis (SC)"},{"key":"bibr50-1094342014522573","unstructured":"Fletcher R (1981) Practical Methods of Optimization. Volume 2: Constrained Optimization. New York, NY: John Wiley & Sons."},{"key":"bibr51-1094342014522573","author":"Fujita H","year":"2013","journal-title":"Proceedings of the international conference on architectural support for programming languages and operating systems (ASPLOS)"},{"key":"bibr52-1094342014522573","author":"Gainaru A","year":"2012","journal-title":"Proceedings of the IEEE international parallel & distributed processing symposium (IPDPS)"},{"key":"bibr53-1094342014522573","first-page":"1","volume":"4","author":"Gainaru A","year":"2011","journal-title":"Proceedings of managing large-scale systems via the analysis of system logs and the application of machine learning techniques (SLAM\u201911)"},{"key":"bibr54-1094342014522573","author":"Gainaru A","year":"2012","journal-title":"International conference for high-performance computing, networking, storage and analysis (SC)"},{"key":"bibr55-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-23400-2_6"},{"key":"bibr56-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/LED.2010.2102002"},{"key":"bibr57-1094342014522573","author":"Gao Q","year":"2007","journal-title":"International conference for high-performance computing, networking, storage and analysis (SC)"},{"key":"bibr58-1094342014522573","author":"Gao Q","year":"2010","journal-title":"International conference for high-performance computing, networking, storage and analysis (SC)"},{"key":"bibr59-1094342014522573","first-page":"25","author":"Gattiker A","year":"1996","journal-title":"IEEE international workshop on IDDQ testing"},{"key":"bibr60-1094342014522573","unstructured":"Geist A, Lucas B, Snir M, (2012) U.S. Department of Energy fault management workshop. Technical report, U.S. Department of Energy, DC."},{"key":"bibr61-1094342014522573","first-page":"199","author":"Gill B","year":"2009","journal-title":"IEEE international reliability physics symposium"},{"key":"bibr62-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/DFTVS.2003.1250158"},{"key":"bibr63-1094342014522573","volume-title":"Automatic Differentiation of Algorithms: Theory, Implementation, and Application","author":"Griewank A","year":"1991"},{"key":"bibr64-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2007.55"},{"key":"bibr65-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.95"},{"key":"bibr66-1094342014522573","first-page":"1216","author":"Guermouche A","year":"2012","journal-title":"IEEE international parallel & distributed processing symposium (IPDPS)"},{"key":"bibr67-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2001.941390"},{"key":"bibr68-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-662-02427-0"},{"key":"bibr69-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1147\/rd.524.0413"},{"key":"bibr70-1094342014522573","volume-title":"Numerical Methods for Scientists and Engineers","author":"Hamming R","year":"1987"},{"key":"bibr71-1094342014522573","author":"Hangal S","year":"2002","journal-title":"Proceedings of the 2002 international conference on software engineering"},{"key":"bibr72-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/TEST.1993.470686"},{"key":"bibr73-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2012.6263960"},{"key":"bibr74-1094342014522573","author":"Hari SKS","year":"2012","journal-title":"Proceedings of the international conference on architectural support for programming languages and operating systems (ASPLOS)"},{"key":"bibr75-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1145\/1669112.1669129"},{"key":"bibr76-1094342014522573","first-page":"617","author":"Hazucha P","year":"2003","journal-title":"IEEE custom integrated circuits conference"},{"key":"bibr77-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/MSST.2005.22"},{"key":"bibr78-1094342014522573","author":"Heien E","year":"2011","journal-title":"International conference for high-performance computing, networking, storage and analysis (SC)"},{"key":"bibr79-1094342014522573","author":"Heiser G","year":"2011","journal-title":"13th workshop on hot topics in operating systems (HotOS)"},{"key":"bibr80-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRev.116.445"},{"key":"bibr81-1094342014522573","author":"Hogan S","year":"2012","journal-title":"2nd workshop on fault-tolerance for HPC at extreme scale (FTXS 2012)"},{"key":"bibr82-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1984.1676475"},{"key":"bibr83-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1108\/eb035229"},{"key":"bibr84-1094342014522573","first-page":"111","author":"Hwang AA","year":"2012","journal-title":"Proceedings of the international conference on architectural support for programming languages and operating systems (ASPLOS)"},{"key":"bibr85-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/TED.2010.2047907"},{"key":"bibr86-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2003.1160056"},{"key":"bibr87-1094342014522573","doi-asserted-by":"crossref","unstructured":"Katz DS, Daly J, DeBardeleben N, (2009) 2009 fault tolerance for extreme-scale computing workshop. Technical report ANL\/MCS-TM-312, Argonne National Laboratory, IL.","DOI":"10.2172\/971988"},{"key":"bibr88-1094342014522573","author":"Kerbyson D","year":"2012","journal-title":"Second international workshop on high-performance infrastructure for scalable tools"},{"issue":"3","key":"bibr89-1094342014522573","first-page":"508","volume":"14","author":"Kubota K","year":"1992","journal-title":"Journal of Information Processing"},{"key":"bibr90-1094342014522573","first-page":"679","author":"Kundu S","year":"2004","journal-title":"Proceedings of the international test conference (ITC)"},{"key":"bibr91-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1145\/2370816.2370848"},{"key":"bibr92-1094342014522573","author":"Laguna I","year":"2011","journal-title":"International conference for high-performance computing, networking, storage and analysis (SC)"},{"key":"bibr93-1094342014522573","first-page":"1","author":"Lange J","year":"2010","journal-title":"IEEE international symposium on parallel & distributed processing (IPDPS)"},{"key":"bibr94-1094342014522573","author":"Lee GL","year":"2007","journal-title":"International conference on parallel computing: Architectures, algorithms and applications (ParCo)"},{"key":"bibr95-1094342014522573","author":"Lee GL","year":"2008","journal-title":"International conference for high-performance computing, networking, storage and analysis (SC)"},{"key":"bibr96-1094342014522573","author":"Li ML","year":"2008","journal-title":"Proceedings of the IEEE\/IFIP international conference on dependable systems and networks (DSN)"},{"key":"bibr97-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1145\/1346281.1346315"},{"key":"bibr98-1094342014522573","doi-asserted-by":"crossref","first-page":"227","DOI":"10.1109\/ISPA.2008.125","author":"Lindekugel K","year":"2008","journal-title":"International symposium on parallel and distributed processing with applications (ISPA\u201908)"},{"key":"bibr99-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1023\/A:1021858008222"},{"key":"bibr100-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/12.278479"},{"key":"bibr101-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/ARITH.1989.72831"},{"key":"bibr102-1094342014522573","unstructured":"Los Alamos National Lab (2006) Operational data to support and enable computer science research. Available at: http:\/\/institutes.lanl.gov\/data\/fdata\/ (accessed 25 February 2014)."},{"key":"bibr103-1094342014522573","first-page":"821","author":"Louren\u00e7o J","year":"2001","journal-title":"International conference on computational science (ICCS)"},{"key":"bibr104-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1145\/2465813.2465821"},{"key":"bibr105-1094342014522573","author":"Lunardini D","year":"2004","journal-title":"2004 workshop on radiation effects on components and systems, radiation hardening techniques and new developments"},{"key":"bibr106-1094342014522573","first-page":"584","author":"Lyle G","year":"2009","journal-title":"Proceedings of the IEEE\/IFIP international conference on dependable systems and networks (DSN)"},{"key":"bibr107-1094342014522573","first-page":"1148","author":"Maxwell P","year":"2000","journal-title":"Proceedings of the international test conference (ITC)"},{"key":"bibr108-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.18"},{"key":"bibr109-1094342014522573","author":"Mirgorodskiy AV","year":"2006","journal-title":"International conference for high-performance computing, networking, storage and analysis (SC)"},{"key":"bibr110-1094342014522573","unstructured":"Mitchell R (1977) The Underground Grammarian, Vol., No. 1, January. Available at http:\/\/www.sourcetext.com\/grammarian\/ (accessed 25 February 2014)."},{"key":"bibr111-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/ICICDT.2007.4299587"},{"key":"bibr112-1094342014522573","unstructured":"Mokhtarani A, Kramer W, Hick J (2008) Reliability results of NERSC systems. https:\/\/publications.lbl.gov\/islandora\/object\/ir%3A150330 (accessed 25 February 2014)."},{"key":"bibr113-1094342014522573","author":"Moody A","year":"2010","journal-title":"International conference for high-performance computing, networking, storage and analysis (SC)"},{"key":"bibr114-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1145\/2168773.2168777"},{"key":"bibr115-1094342014522573","unstructured":"MPIPlugIn (2013) MPI plugin for KDevelop. Available at: http:\/\/sourceforge.net\/projects\/mpiplugin\/ (accessed 25 February 2014)."},{"key":"bibr116-1094342014522573","author":"Nakano J","year":"2006","journal-title":"Proceedings of the international symposium on high performance computer architecture (HPCA)"},{"key":"bibr117-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/ISPA.2010.82"},{"key":"bibr118-1094342014522573","first-page":"2F","author":"Nassif S","year":"2012","journal-title":"2012 IEEE international reliability physics symposium (IRPS)"},{"key":"bibr119-1094342014522573","unstructured":"NCAR (2014) Community earth system model. Available at: http:\/\/www2.cesm.ucar.edu\/ (accessed 25 February 2014)."},{"key":"bibr120-1094342014522573","unstructured":"Network Working Group (2009) The syslog protocol. Available at: http:\/\/tools.ietf.org\/html\/rfc5424 (accessed 25 February 2014)."},{"key":"bibr121-1094342014522573","first-page":"454","author":"Nigh P","year":"2000","journal-title":"Proceedings of the international test conference (ITC)"},{"key":"bibr122-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1016\/j.aap.2005.10.004"},{"key":"bibr123-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2007.103"},{"key":"bibr124-1094342014522573","first-page":"211","author":"Park Y","year":"2012","journal-title":"24th international symposium on computer architecture and high performance computing (SBAC-PAD)"},{"key":"bibr125-1094342014522573","author":"Pattabiraman K","year":"2008","journal-title":"Proceedings of the IEEE\/IFIP international conference on dependable systems and networks (DSN)"},{"key":"bibr126-1094342014522573","first-page":"97","author":"Pattabiraman K","year":"2006","journal-title":"European dependable computing conference"},{"key":"bibr127-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2002.1003567"},{"key":"bibr128-1094342014522573","first-page":"169","author":"Racunas P","year":"2007","journal-title":"Proceedings of the international symposium on high performance computer architecture (HPCA)"},{"key":"bibr129-1094342014522573","unstructured":"Ramachandran P (2011) Detecting and recovering from in-core hardware faults through software anomaly treatment. PhD Thesis, University of Illinois at Urbana Champaign, IL."},{"issue":"8","key":"bibr130-1094342014522573","first-page":"18","volume":"40","author":"Randall A V","year":"2006","journal-title":"Computerworld"},{"key":"bibr131-1094342014522573","volume-title":"Error Coding for Arithmetic Processors","author":"Rao TRN","year":"1974"},{"key":"bibr132-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1016\/j.microrel.2004.05.023"},{"key":"bibr133-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1145\/1113841.1113843"},{"key":"bibr134-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2005.34"},{"key":"bibr135-1094342014522573","unstructured":"Rogue Wave Software (2013) TotalView Debugger. Available at: http:\/\/www.roguewave.com\/products\/totalview.aspx (accessed 25 February 2014)."},{"key":"bibr136-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-23400-2_53"},{"key":"bibr137-1094342014522573","author":"Roth PC","year":"2003","journal-title":"International conference for high-performance computing, networking, storage and analysis (SC)"},{"key":"bibr138-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/12.494098"},{"key":"bibr139-1094342014522573","first-page":"70","author":"Sahoo S","year":"2008","journal-title":"Proceedings of the IEEE\/IFIP international conference on dependable systems and networks (DSN)"},{"key":"bibr140-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1145\/1670679.1670680"},{"key":"bibr141-1094342014522573","first-page":"2172","author":"Saxena N","year":"2002","journal-title":"IEEE international conference on systems, man, and cybernetics"},{"key":"bibr142-1094342014522573","first-page":"1","author":"Schroeder B","year":"2007","journal-title":"Proceedings of the 5th USENIX conference on file and storage technologies (FAST)"},{"key":"bibr143-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2009.4"},{"key":"bibr144-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1145\/1555349.1555372"},{"key":"bibr145-1094342014522573","doi-asserted-by":"crossref","unstructured":"Seltborg P, Polanski A, Petrochenkov S, (2005) Radiation shielding of high-energy neutrons in SAD. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 550(1): 313\u2013328.","DOI":"10.1016\/j.nima.2005.04.071"},{"key":"bibr146-1094342014522573","author":"Shipman G","year":"2010","journal-title":"The 52nd Cray user group conference"},{"key":"bibr147-1094342014522573","first-page":"1","author":"Slayman C","year":"2011","journal-title":"Proceedings of the annual reliability and maintainability symposium (RAMS)"},{"key":"bibr148-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/40.755464"},{"key":"bibr149-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1177\/1094342004048535"},{"key":"bibr150-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2002.1003568"},{"key":"bibr151-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1147\/rd.435.0863"},{"key":"bibr152-1094342014522573","author":"Sridharan V","year":"2012","journal-title":"International conference for high-performance computing, networking, storage and analysis (SC)"},{"key":"bibr153-1094342014522573","author":"Stearley J","year":"2005","journal-title":"Proceedings of the Linux clusters institute conference"},{"key":"bibr154-1094342014522573","volume-title":"The Black Swan: The Impact of the Highly Improbable","author":"Taleb N","year":"2010"},{"key":"bibr155-1094342014522573","volume-title":"Multigrid","author":"Trottenberg U","year":"2001"},{"key":"bibr156-1094342014522573","first-page":"107","author":"Turmon M","year":"2000","journal-title":"Proceedings of the IEEE\/IFIP international conference on dependable systems and networks (DSN)"},{"key":"bibr157-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2003.1197125"},{"key":"bibr158-1094342014522573","first-page":"8","author":"Van Horn J","year":"2005","journal-title":"Proceedings of the IEEE international test conference (ITC)"},{"key":"bibr159-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2006.40"},{"key":"bibr160-1094342014522573","unstructured":"Wittgenstein L (1953) Philosophical Investigations. The Macmillan Company, New York."},{"key":"bibr161-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1063\/1.3524521"},{"key":"bibr162-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1145\/361147.361115"},{"key":"bibr163-1094342014522573","first-page":"35","author":"Yu J","year":"2009","journal-title":"Proceedings of the 7th annual IEEE\/ACM international symposium on code generation and optimization"},{"key":"bibr164-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1063\/1.3679610"},{"key":"bibr165-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2006.887832"},{"key":"bibr166-1094342014522573","first-page":"1","author":"Zheng G","year":"2012","journal-title":"Proceedings of the IEEE\/IFIP international conference on dependable systems and networks (DSN)"},{"key":"bibr167-1094342014522573","first-page":"1","author":"Zhou J","year":"2010","journal-title":"17th IEEE international symposium on the physical and failure analysis of integrated circuits (IPFA)"},{"key":"bibr168-1094342014522573","doi-asserted-by":"publisher","DOI":"10.1016\/j.anucene.2010.01.017"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342014522573","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/1094342014522573","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342014522573","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,4]],"date-time":"2025-03-04T06:18:44Z","timestamp":1741069124000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/1094342014522573"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014,3,21]]},"references-count":168,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2014,5]]}},"alternative-id":["10.1177\/1094342014522573"],"URL":"https:\/\/doi.org\/10.1177\/1094342014522573","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"value":"1094-3420","type":"print"},{"value":"1741-2846","type":"electronic"}],"subject":[],"published":{"date-parts":[[2014,3,21]]}}}