{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,6]],"date-time":"2025-12-06T17:02:08Z","timestamp":1765040528730,"version":"3.41.0"},"reference-count":47,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2014,10,27]],"date-time":"2014-10-27T00:00:00Z","timestamp":1414368000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["NSF CCF 0541383 and CCF 0811693"],"award-info":[{"award-number":["NSF CCF 0541383 and CCF 0811693"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"name":"OpenSPARC Center of Excellence at Illinois supported by Sun Microsystems"},{"DOI":"10.13039\/100002418","name":"Intel Corporation","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100002418","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100004316","name":"International Business Machines Corporation","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100004316","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Gigascale Systems Research Center"},{"DOI":"10.13039\/100000143","name":"Division of Computing and Communication Foundations","doi-asserted-by":"publisher","award":["NSF CCF 0541383 and CCF 0811693"],"award-info":[{"award-number":["NSF CCF 0541383 and CCF 0811693"]}],"id":[{"id":"10.13039\/100000143","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2014,10,27]]},"abstract":"<jats:p>With continued process scaling, the rate of hardware failures in commodity systems is increasing. Because these commodity systems are highly sensitive to cost, traditional solutions that employ heavy redundancy to handle such failures are no longer acceptable owing to their high associated costs.<\/jats:p>\n          <jats:p>Detecting such faults by identifying anomalous software execution and recovering through checkpoint-and-replay is emerging as a viable low-cost alternative for future commodity systems. An important but commonly ignored aspect of such solutions is ensuring that external outputs to the system are fault-free. The outputs must be delayed until the detectors guarantee this, influencing fault-free performance. The overheads for resiliency must thus be evaluated while taking these delays into consideration; prior work has largely ignored this relationship.<\/jats:p>\n          <jats:p>This article concerns recovery for I\/O intensive applications from in-core faults. We present a strategy to buffer external outputs using dedicated hardware and show that checkpoint intervals previously considered as acceptable incur exorbitant overheads when hardware buffering is considered. We then present two techniques to reduce the checkpoint interval and demonstrate a practical solution that provides high resiliency while incurring low overheads.<\/jats:p>","DOI":"10.1145\/2656342","type":"journal-article","created":{"date-parts":[[2014,10,28]],"date-time":"2014-10-28T12:40:29Z","timestamp":1414500029000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["Hardware Fault Recovery for I\/O Intensive Applications"],"prefix":"10.1145","volume":"11","author":[{"given":"Pradeep","family":"Ramachandran","sequence":"first","affiliation":[{"name":"Intel Corporation, Sarjapur Outner Ring Road, Bangalore"}]},{"given":"Siva Kumar Sastry","family":"Hari","sequence":"additional","affiliation":[{"name":"NVIDIA, San Tomas Expy, Santa Clara, CA"}]},{"given":"Manlap","family":"Li","sequence":"additional","affiliation":[{"name":"Latham and Watkins LLP, San Francisco, CA"}]},{"given":"Sarita V.","family":"Adve","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana Champaign, Urbana, Illinois"}]}],"member":"320","published-online":{"date-parts":[[2014,10,27]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2008.30"},{"key":"e_1_2_1_2_1","volume-title":"Austin","author":"Todd","year":"1998","unstructured":"Todd M. Austin . 1998 . DIVA : A Reliable Substrate for Deep Submicron Microarchitecture Design. In MICRO. 196--207. Todd M. Austin. 1998. DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design. In MICRO. 196--207."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2005.70"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2005.110"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/1250727.1250728"},{"key":"e_1_2_1_6_1","volume-title":"Compiler-Enhanced Incremental Checkpointing. In Workshop on Languages and Compilers for Parallel Computing.","author":"Bronevetsky Greg","year":"2008","unstructured":"Greg Bronevetsky , Daniel Marques , Keshav Pingali , and Radu Rugina . 2008 . Compiler-Enhanced Incremental Checkpointing. In Workshop on Languages and Compilers for Parallel Computing. Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Radu Rugina. 2008. Compiler-Enhanced Incremental Checkpointing. In Workshop on Languages and Compilers for Parallel Computing."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2006.15"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.39"},{"key":"e_1_2_1_9_1","volume-title":"Linux Device Drivers","author":"Corbet Jonathan","unstructured":"Jonathan Corbet , Greg Kroah-Hartman , and Alessandro Rubini . 2005. Linux Device Drivers ( 3 rd ed.). O\u2019Reilly . Jonathan Corbet, Greg Kroah-Hartman, and Alessandro Rubini. 2005. Linux Device Drivers (3rd ed.). O\u2019Reilly.","edition":"3"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1816026"},{"key":"e_1_2_1_11_1","unstructured":"Marc de Kruijf and Karhikeyan Sankaralingam. 2009. Exploring the Synergy of Emerging Workloads and Si Reliability Trends. In SELSE.  Marc de Kruijf and Karhikeyan Sankaralingam. 2009. Exploring the Synergy of Emerging Workloads and Si Reliability Trends. In SELSE."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1346281.1346295"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1133255.1133999"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.5555\/1299042.1299113"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.scico.2007.01.015"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1736020.1736063"},{"key":"e_1_2_1_17_1","unstructured":"Siva Hari Man-Lap Li P. Ramachandran Byn Choi and S. V. Adve. 2009. Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems. In MICRO.  Siva Hari Man-Lap Li P. Ramachandran Byn Choi and S. V. Adve. 2009. Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems. In MICRO."},{"key":"e_1_2_1_18_1","unstructured":"Siva Kumar Sastry Hari Sarita V. Adve and Helia Naeimi. 2012. Low-Cost Program-Level Detectors for Reducing Silent Data Corruptions. In DSN.  Siva Kumar Sastry Hari Sarita V. Adve and Helia Naeimi. 2012. Low-Cost Program-Level Detectors for Reducing Silent Data Corruptions. In DSN."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1629575.1629582"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1508244.1508251"},{"key":"e_1_2_1_23_1","unstructured":"Manlap Li Pradeep Ramachandran Swarup Sahoo Sarita Adve Vikram Adve and Yuanyuan Zhou. 2008a. Trace-Based Microarchitecture-Level Diagnosis of Permanent Hardware Faults. In DSN.  Manlap Li Pradeep Ramachandran Swarup Sahoo Sarita Adve Vikram Adve and Yuanyuan Zhou. 2008a. Trace-Based Microarchitecture-Level Diagnosis of Permanent Hardware Faults. In DSN."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1346281.1346315"},{"key":"e_1_2_1_25_1","volume-title":"Siva Hari, and Sarita Adve.","author":"Li Manlap","year":"2009","unstructured":"Manlap Li , Pradeep Ramachandran , Rahmet Ulya Karpuzcu , Siva Hari, and Sarita Adve. 2009 . Accurate Microarchitecture-Level Fault Modeling for Studying Hardware Faults. In HPCA. Manlap Li, Pradeep Ramachandran, Rahmet Ulya Karpuzcu, Siva Hari, and Sarita Adve. 2009. Accurate Microarchitecture-Level Fault Modeling for Studying Hardware Faults. In HPCA."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2007.346196"},{"key":"e_1_2_1_27_1","doi-asserted-by":"crossref","unstructured":"Peter B. Mark. 1985. The Sequoia Computer: A Fault-Tolerant Tightly-Coupled Multiprocessor Architecture. In ISCA.   Peter B. Mark. 1985. The Sequoia Computer: A Fault-Tolerant Tightly-Coupled Multiprocessor Architecture. In ISCA.","DOI":"10.1145\/327070.327218"},{"key":"e_1_2_1_28_1","unstructured":"Yoshio Masubuchi Satoshi Hoshina Tomofumi Shimada Hideaki Hirayama and Nobuhiro Kato. 1997. Fault Recovery Mechanism for Multiprocessor Servers. In FTCS.   Yoshio Masubuchi Satoshi Hoshina Tomofumi Shimada Hideaki Hirayama and Nobuhiro Kato. 1997. Fault Recovery Mechanism for Multiprocessor Servers. In FTCS."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.8"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2005.37"},{"key":"e_1_2_1_31_1","unstructured":"Shubhendu S. Mukherjee Christopher Weaver Joel Emer Steven K. Reinhardt and Todd Austin. 2003. A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor. In MICRO.   Shubhendu S. Mukherjee Christopher Weaver Joel Emer Steven K. Reinhardt and Todd Austin. 2003. A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor. In MICRO."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/1542476.1542504"},{"key":"e_1_2_1_33_1","unstructured":"Jun Nakano Pablo Montesinos Kourosh Gharachorloo and Josep Torrellas. 2006. ReVive I\/O: Efficient Handling of I\/O in Highly-Available Rollback-Recovery Servers. In HPCA.  Jun Nakano Pablo Montesinos Kourosh Gharachorloo and Josep Torrellas. 2006. ReVive I\/O: Efficient Handling of I\/O in Highly-Available Rollback-Recovery Servers. In HPCA."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/11408901_8"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000064.2000089"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/EDCC.2006.9"},{"key":"e_1_2_1_37_1","doi-asserted-by":"crossref","unstructured":"A. Pellegrini R. Smolinski X. Fu L. Chen S. K. S. Hari J. Jiang S. V. Adve T. Austin and V. Bertacco. 2012. CrashTest\u2019ing SWAT: Accurate Gate-Level Evaluation of Symptom-Based Resiliency Solutions. In Design Automation and Test Europe.   A. Pellegrini R. Smolinski X. Fu L. Chen S. K. S. Hari J. Jiang S. V. Adve T. Austin and V. Bertacco. 2012. CrashTest\u2019ing SWAT: Accurate Gate-Level Evaluation of Symptom-Based Resiliency Solutions. In Design Automation and Test Europe.","DOI":"10.1109\/DATE.2012.6176660"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/1555754.1555769"},{"key":"e_1_2_1_39_1","doi-asserted-by":"crossref","unstructured":"Milos Prvulovic Zheng Zhang and Josep Torrellas. 2002. ReVive: Cost-Effective Arch Support for Rollback Recovery in Shared-Mem Multiprocessors. In ISCA.   Milos Prvulovic Zheng Zhang and Josep Torrellas. 2002. ReVive: Cost-Effective Arch Support for Rollback Recovery in Shared-Mem Multiprocessors. In ISCA.","DOI":"10.1145\/545214.545228"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2007.346195"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/1113841.1113843"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/1454115.1454124"},{"key":"e_1_2_1_43_1","doi-asserted-by":"crossref","unstructured":"Swarup Sahoo Man-Lap Li P. Ramachandran S. V. Adve V. S. Adve and Yuanyuan Zhou. 2008. Using Likely Program Invariants to Detect Hardware Errors. In DSN.  Swarup Sahoo Man-Lap Li P. Ramachandran S. V. Adve V. S. Adve and Yuanyuan Zhou. 2008. Using Likely Program Invariants to Detect Hardware Errors. In DSN.","DOI":"10.1109\/DSN.2008.4630072"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/1024393.1024420"},{"key":"e_1_2_1_45_1","doi-asserted-by":"crossref","unstructured":"Daniel Sorin Milo Martin Mark Hill and David Wood. 2002. SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint\/Recovery. In ISCA.   Daniel Sorin Milo Martin Mark Hill and David Wood. 2002. SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint\/Recovery. In ISCA.","DOI":"10.1145\/545214.545229"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1147\/rd.435.0863"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/1366224.1366225"},{"key":"e_1_2_1_48_1","unstructured":"Michael Swift Muthukaruppan Annamalai Brian Bershad and Henry Levy. 2004. Recovering Device Drivers. In OSDI.   Michael Swift Muthukaruppan Annamalai Brian Bershad and Henry Levy. 2004. Recovering Device Drivers. In OSDI."},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2006.40"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2656342","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2656342","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T07:19:37Z","timestamp":1750231177000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2656342"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014,10,27]]},"references-count":47,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2014,10,27]]}},"alternative-id":["10.1145\/2656342"],"URL":"https:\/\/doi.org\/10.1145\/2656342","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2014,10,27]]},"assertion":[{"value":"2013-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2014-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2014-10-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}