{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T23:01:58Z","timestamp":1777676518347,"version":"3.51.4"},"reference-count":35,"publisher":"SAGE Publications","issue":"3","license":[{"start":{"date-parts":[[2013,7,3]],"date-time":"2013-07-03T00:00:00Z","timestamp":1372809600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2013,8]]},"abstract":"<jats:p>As large-scale systems evolve towards post-petascale computing, it is crucial to focus on providing fault-tolerance strategies that aim to minimize fault\u2019s effects on applications. By far the most popular technique is the checkpoint\u2013restart strategy. A complement to this classical approach is failure avoidance, by which the occurrence of a fault is predicted and proactive measures are taken. This requires a reliable prediction system to anticipate failures and their locations. One way of offering prediction is by the analysis of system logs generated during production by large-scale systems. Current research in this field presents a number of limitations that make them unusable for running on real production high-performance computing (HPC) systems. Based on our observations that different failures have different distributions and behaviours, we propose a novel hybrid approach that combines signal analysis with data mining in order to overcome current limitations. We show that by analysing each event according to its specific behaviour, our prediction provides a precision of over 90% and its able to discover about 50% of all failures in a system, result which allows its integration in proactive fault tolerance protocols.<\/jats:p>","DOI":"10.1177\/1094342013488258","type":"journal-article","created":{"date-parts":[[2013,7,4]],"date-time":"2013-07-04T20:16:31Z","timestamp":1372968991000},"page":"273-282","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":25,"title":["Failure prediction for HPC systems and applications"],"prefix":"10.1177","volume":"27","author":[{"given":"Ana","family":"Gainaru","sequence":"first","affiliation":[{"name":"National Centre for Supercomputing Applications, Urbana, IL, USA"},{"name":"University of Illinois at Urbana-Champaign, Urbana, IL, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Franck","family":"Cappello","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, Urbana, IL, USA"},{"name":"INRIA, Rocquencourt, Le Chesnay Cedex, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Marc","family":"Snir","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, Urbana, IL, USA"},{"name":"Argonne National Laboratory, Argonne, IL, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"William","family":"Kramer","sequence":"additional","affiliation":[{"name":"National Centre for Supercomputing Applications, Urbana, IL, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"179","published-online":{"date-parts":[[2013,7,3]]},"reference":[{"key":"bibr1-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2008.5214725"},{"key":"bibr2-1094342013488258","volume-title":"Impact of Fault Prediction on Checkpointing Strategies","author":"Aupy G","year":"2012"},{"key":"bibr3-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063427"},{"key":"bibr4-1094342013488258","volume-title":"Conference of the Prognostics and Health Management Society","author":"Bolander N","year":"2009"},{"key":"bibr5-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2013.74"},{"key":"bibr6-1094342013488258","first-page":"23","volume-title":"Symposium on Networked Systems Design and Implementation","volume":"1","author":"Chen MY","year":"2004"},{"key":"bibr7-1094342013488258","first-page":"1","volume-title":"International Conference on Dependable System and Networks","author":"DiMartino C","year":"2012"},{"key":"bibr8-1094342013488258","first-page":"71","volume-title":"Handbook of Software Reliability Engineering","author":"Farr W","year":"1996"},{"key":"bibr9-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2012.107"},{"key":"bibr10-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2012.57"},{"key":"bibr11-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-23400-2_6"},{"key":"bibr12-1094342013488258","volume-title":"Reliability Theory: With Applications to Preventive Maintenance","author":"Gertsbakh I","year":"2000"},{"key":"bibr13-1094342013488258","first-page":"630","volume":"6","author":"Gu J","year":"2010","journal-title":"Journal of Parallel and Distributed Computing"},{"key":"bibr14-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063444"},{"key":"bibr15-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1145\/2189750.2150989"},{"key":"bibr16-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2011.52"},{"key":"bibr17-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2006.18"},{"key":"bibr18-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1145\/1740390.1740411"},{"key":"bibr19-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.310"},{"key":"bibr20-1094342013488258","first-page":"160","author":"Nassar FA","year":"1985","journal-title":"IEEE Real-Time Systems Symposium"},{"key":"bibr21-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1142\/S0218539310003731"},{"key":"bibr22-1094342013488258","author":"Rajachandrasekar R","year":"2012","journal-title":"International Parallel and Distributed Processing Symposium Workshops"},{"key":"bibr23-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1145\/1670679.1670680"},{"key":"bibr24-1094342013488258","first-page":"161","author":"Salfner F","year":"2007","journal-title":"Symposium on Reliable Distributed Systems"},{"key":"bibr25-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2009.4"},{"key":"bibr26-1094342013488258","unstructured":"Snir M, Gropp W, Kogge P (2011) Exascale research: preparing for the post Moore era. Computer Science Whitepapers. Available at: https:\/\/www.ideals.illinois.edu\/bitstream\/handle\/2142\/25469\/Exascale%20Research.pdf?sequence=2"},{"key":"bibr27-1094342013488258","first-page":"155","volume-title":"Workshop on Managing System Automatically and Dynamically","volume":"2","author":"Stearley J","year":"2012"},{"key":"bibr28-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1147\/sj.413.0461"},{"key":"bibr29-1094342013488258","first-page":"96","author":"Wang C","year":"2010","journal-title":"Network Operations and Management Symposium"},{"key":"bibr30-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2009.19"},{"key":"bibr31-1094342013488258","first-page":"259","volume-title":"IEEE Conference on Dependable Systems and Networks Workshops","author":"Yu L","year":"2011"},{"key":"bibr32-1094342013488258","first-page":"93","volume-title":"International Conference on Cluster Computing CLUSTER","author":"Zheng GB","year":"2004"},{"key":"bibr33-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1109\/DSNW.2010.5542627"},{"key":"bibr34-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTR.2007.4629246"},{"key":"bibr35-1094342013488258","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.83"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342013488258","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/1094342013488258","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342013488258","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T08:19:13Z","timestamp":1777450753000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/1094342013488258"}},"subtitle":["Current situation and open issues"],"short-title":[],"issued":{"date-parts":[[2013,7,3]]},"references-count":35,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2013,8]]}},"alternative-id":["10.1177\/1094342013488258"],"URL":"https:\/\/doi.org\/10.1177\/1094342013488258","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"value":"1094-3420","type":"print"},{"value":"1741-2846","type":"electronic"}],"subject":[],"published":{"date-parts":[[2013,7,3]]}}}