{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T16:04:22Z","timestamp":1775837062064,"version":"3.50.1"},"reference-count":43,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2018,8,31]],"date-time":"2018-08-31T00:00:00Z","timestamp":1535673600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"DOE Office of Science User Facility","award":["DE-AC02-06CH11357"],"award-info":[{"award-number":["DE-AC02-06CH11357"]}]},{"DOI":"10.13039\/100000001","name":"NSF","doi-asserted-by":"publisher","award":["CCF-1336580, CNS-1350499, CNS-1526304, and CNS-1563956"],"award-info":[{"award-number":["CCF-1336580, CNS-1350499, CNS-1526304, and CNS-1563956"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Storage"],"published-print":{"date-parts":[[2018,8,31]]},"abstract":"<jats:p>Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.<\/jats:p>","DOI":"10.1145\/3242086","type":"journal-article","created":{"date-parts":[[2018,10,3]],"date-time":"2018-10-03T11:57:58Z","timestamp":1538567878000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":84,"title":["Fail-Slow at Scale"],"prefix":"10.1145","volume":"14","author":[{"given":"Haryadi S.","family":"Gunawi","sequence":"first","affiliation":[{"name":"University of Chicago"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2248-5961","authenticated-orcid":false,"given":"Riza O.","family":"Suminto","sequence":"additional","affiliation":[{"name":"University of Chicago"}]},{"given":"Russell","family":"Sears","sequence":"additional","affiliation":[{"name":"Pure Storage"}]},{"given":"Casey","family":"Golliher","sequence":"additional","affiliation":[{"name":"Pure Storage"}]},{"given":"Swaminathan","family":"Sundararaman","sequence":"additional","affiliation":[{"name":"Parallel Machines"}]},{"given":"Xing","family":"Lin","sequence":"additional","affiliation":[{"name":"NetApp"}]},{"given":"Tim","family":"Emami","sequence":"additional","affiliation":[{"name":"NetApp"}]},{"given":"Weiguang","family":"Sheng","sequence":"additional","affiliation":[{"name":"Huawei"}]},{"given":"Nematollah","family":"Bidokhti","sequence":"additional","affiliation":[{"name":"Huawei"}]},{"given":"Caitie","family":"McCaffrey","sequence":"additional","affiliation":[{"name":"Twitter"}]},{"given":"Deepthi","family":"Srinivasan","sequence":"additional","affiliation":[{"name":"Nutanix"}]},{"given":"Biswaranjan","family":"Panda","sequence":"additional","affiliation":[{"name":"Nutanix"}]},{"given":"Andrew","family":"Baptist","sequence":"additional","affiliation":[{"name":"IBM"}]},{"given":"Gary","family":"Grider","sequence":"additional","affiliation":[{"name":"Los Alamos National Laboratory"}]},{"given":"Parks M.","family":"Fields","sequence":"additional","affiliation":[{"name":"Los Alamos National Laboratory"}]},{"given":"Kevin","family":"Harms","sequence":"additional","affiliation":[{"name":"Argonne National Laboratory"}]},{"given":"Robert B.","family":"Ross","sequence":"additional","affiliation":[{"name":"Argonne National Laboratory"}]},{"given":"Andree","family":"Jacobson","sequence":"additional","affiliation":[{"name":"New Mexico Consortium"}]},{"given":"Robert","family":"Ricci","sequence":"additional","affiliation":[{"name":"University of Utah"}]},{"given":"Kirk","family":"Webb","sequence":"additional","affiliation":[{"name":"University of Utah"}]},{"given":"Peter","family":"Alvaro","sequence":"additional","affiliation":[{"name":"University of California, Santa Cruz"}]},{"given":"H. Birali","family":"Runesha","sequence":"additional","affiliation":[{"name":"University of Chicago Research Computing Center"}]},{"given":"Mingzhe","family":"Hao","sequence":"additional","affiliation":[{"name":"University of Chicago"}]},{"given":"Huaicheng","family":"Li","sequence":"additional","affiliation":[{"name":"University of Chicago"}]}],"member":"320","published-online":{"date-parts":[[2018,10,3]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2011. NAND Flash Media Management Through RAIN. Micron.  2011. NAND Flash Media Management Through RAIN. Micron."},{"key":"e_1_2_1_2_1","volume-title":"Open Hardware Monitor. Retrieved","year":"2017","unstructured":"2017. Open Hardware Monitor. Retrieved December 2017 from http:\/\/openhardwaremonitor.org. 2017. Open Hardware Monitor. Retrieved December 2017 from http:\/\/openhardwaremonitor.org."},{"key":"e_1_2_1_3_1","volume-title":"UCARE: Fail-Slow Database. Retrieved","year":"2018","unstructured":"2018. UCARE: Fail-Slow Database. Retrieved February 2018 from http:\/\/ucare.cs.uchicago.edu\/projects\/failslow\/. 2018. UCARE: Fail-Slow Database. Retrieved February 2018 from http:\/\/ucare.cs.uchicago.edu\/projects\/failslow\/."},{"key":"e_1_2_1_4_1","volume-title":"Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI\u201916)","author":"Alagappan Ramnatthan","unstructured":"Ramnatthan Alagappan , Aishwarya Ganesan , Yuvraj Patel , Thanumalayan Sankaranarayana Pillai , Andrea C. Arpaci-Dusseau , and Remzi H . Arpaci-Dusseau. 2016. Correlated crash vulnerabilities . In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI\u201916) . Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated crash vulnerabilities. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI\u201916)."},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS VIII).","author":"Remzi","unstructured":"Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. 2001. Fail-stutter fault tolerance . In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS VIII). Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. 2001. Fail-stutter fault tolerance. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS VIII)."},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI\u201910)","author":"Attariyan Mona","year":"2010","unstructured":"Mona Attariyan and Jason Flinn . 2010 . Automating configuration troubleshooting with dynamic information flow analysis . In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI\u201910) . Mona Attariyan and Jason Flinn. 2010. Automating configuration troubleshooting with dynamic information flow analysis. In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI\u201910)."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1254882.1254917"},{"key":"e_1_2_1_8_1","volume-title":"Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST\u201908)","author":"Bairavasundaram Lakshmi N.","unstructured":"Lakshmi N. Bairavasundaram , Garth R. Goodson , Bianca Schroeder , Andrea C. Arpaci-Dusseau , and Remzi H . Arpaci-Dusseau. 2008. An analysis of data corruption in the storage stack . In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST\u201908) . Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST\u201908)."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/TDMR.2005.853449"},{"key":"e_1_2_1_10_1","volume-title":"Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST\u201916)","author":"Brewer Eric","year":"2016","unstructured":"Eric Brewer . 2016 . Spinning disks and their cloudy future (keynote) , In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST\u201916) . Eric Brewer. 2016. Spinning disks and their cloudy future (keynote), In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST\u201916)."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2015.49"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2015.7056062"},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS IX).","author":"Candea George","year":"2003","unstructured":"George Candea and Armando Fox . 2003 . Crash-only software . In Proceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS IX). George Candea and Armando Fox. 2003. Crash-only software. In Proceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS IX)."},{"key":"e_1_2_1_14_1","volume-title":"Proceedings of the Greenmetrics Workshop (Greenmetrics\u201913)","author":"Chan Christine S.","year":"2013","unstructured":"Christine S. Chan , Boxiang Pan , Kenny Gross , Kenny Gross , and Tajana Simunic Rosing . 2013 . Correcting vibration-induced performance degradation in enterprise servers . In Proceedings of the Greenmetrics Workshop (Greenmetrics\u201913) . Christine S. Chan, Boxiang Pan, Kenny Gross, Kenny Gross, and Tajana Simunic Rosing. 2013. Correcting vibration-induced performance degradation in enterprise servers. In Proceedings of the Greenmetrics Workshop (Greenmetrics\u201913)."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.5555\/1558977.1558988"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2670979.2670987"},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI\u201904)","author":"Dean Jeffrey","year":"2004","unstructured":"Jeffrey Dean and Sanjay Ghemawat . 2004 . MapReduce: Simplified data processing on large clusters . In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI\u201904) . Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI\u201904)."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2523616.2523627"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the 11th USENIX Symposium on File and Storage Technologies (FAST\u201913)","author":"Do Thanh","unstructured":"Thanh Do , Tyler Harter , Yingchao Liu , Haryadi S. Gunawi , Andrea C. Arpaci-Dusseau , and Remzi H . Arpaci-Dusseau. 2013. HARDFS: Hardening HDFS with selective and lightweight versioning . In Proceedings of the 11th USENIX Symposium on File and Storage Technologies (FAST\u201913) . Thanh Do, Tyler Harter, Yingchao Liu, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2013. HARDFS: Hardening HDFS with selective and lightweight versioning. In Proceedings of the 11th USENIX Symposium on File and Storage Technologies (FAST\u201913)."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2254756.2254778"},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST\u201917)","author":"Ganesan Aishwarya","unstructured":"Aishwarya Ganesan , Ramnatthan Alagappan , Andrea C. Arpaci-Dusseau , and Remzi H . Arpaci-Dusseau. 2017. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions . In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST\u201917) . Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST\u201917)."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2670979.2670986"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2987550.2987583"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3132747.3132774"},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST\u201916)","author":"Hao Mingzhe","unstructured":"Mingzhe Hao , Gokul Soundararajan , Deepak Kenchammana-Hosekote , Andrew A. Chien , and Haryadi S. Gunawi . 2016. The tail at store: A revelation from millions of hours of disk and SSD deployments . In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST\u201916) . Mingzhe Hao, Gokul Soundararajan, Deepak Kenchammana-Hosekote, Andrew A. Chien, and Haryadi S. Gunawi. 2016. The tail at store: A revelation from millions of hours of disk and SSD deployments. In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST\u201916)."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3102980.3103005"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/1629575.1629582"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.5555\/1855511.1855515"},{"key":"e_1_2_1_29_1","volume-title":"Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST\u201915)","author":"Kim Jaeho","unstructured":"Jaeho Kim , Donghee Lee , and Sam H. Noh . 2015. Towards SLO complying SSDs through OPS isolation . In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST\u201915) . Jaeho Kim, Donghee Lee, and Sam H. Noh. 2015. Towards SLO complying SSDs through OPS isolation. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST\u201915)."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2872362.2872374"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.5555\/2750482.2750501"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2745844.2745848"},{"key":"e_1_2_1_33_1","volume-title":"Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST\u201917)","author":"Pillai Thanumalayan Sankaranarayana","unstructured":"Thanumalayan Sankaranarayana Pillai , Ramnatthan Alagappan , Lanyue Lu , Vijay Chidambaram , Andrea C. Arpaci-Dusseau , and Remzi H . Arpaci-Dusseau. 2017. Application crash consistency and performance with CCFS . In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST\u201917) . Thanumalayan Sankaranarayana Pillai, Ramnatthan Alagappan, Lanyue Lu, Vijay Chidambaram, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Application crash consistency and performance with CCFS. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST\u201917)."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/1095810.1095830"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.5555\/1855511.1855517"},{"key":"e_1_2_1_36_1","volume-title":"Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST\u201907)","author":"Schroeder Bianca","unstructured":"Bianca Schroeder and Garth A. Gibson . 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST\u201907) . Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST\u201907)."},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST\u201916)","author":"Schroeder Bianca","year":"2016","unstructured":"Bianca Schroeder , Raghav Lagisetty , and Arif Merchant . 2016 . Flash reliability in production: The expected and the unexpected . In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST\u201916) . Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST\u201916)."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/1555349.1555372"},{"key":"e_1_2_1_39_1","first-page":"9","article-title":"Hard disk drive reliability modeling and failure prediction","volume":"43","author":"Strom Brian D.","year":"2007","unstructured":"Brian D. Strom , SungChang Lee , George W. Tyndall , and Andrei Khurshudov . 2007 . Hard disk drive reliability modeling and failure prediction . IEEE Transactions on Magnetics (TMAG) 43 , 9 (September 2007). Brian D. Strom, SungChang Lee, George W. Tyndall, and Andrei Khurshudov. 2007. Hard disk drive reliability modeling and failure prediction. IEEE Transactions on Magnetics (TMAG) 43, 9 (September 2007).","journal-title":"IEEE Transactions on Magnetics (TMAG)"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3127479.3131622"},{"key":"e_1_2_1_41_1","volume-title":"Proceedings of the International Conference on Computing, Networking and Communications (ICNC\u201912)","author":"Yaakobi Eitan","unstructured":"Eitan Yaakobi , Laura Grupp , Paul H. Siegel , Steven Swanson , and Jack K. Wolf . 2012. Characterization and error-correcting codes for TLC flash memories . In Proceedings of the International Conference on Computing, Networking and Communications (ICNC\u201912) . Eitan Yaakobi, Laura Grupp, Paul H. Siegel, Steven Swanson, and Jack K. Wolf. 2012. Characterization and error-correcting codes for TLC flash memories. In Proceedings of the International Conference on Computing, Networking and Communications (ICNC\u201912)."},{"key":"e_1_2_1_42_1","volume-title":"Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST\u201917)","author":"Yan Shiqin","unstructured":"Shiqin Yan , Huaicheng Li , Mingzhe Hao , Michael Hao Tong , Swaminathan Sundararaman , Andrew A. Chien , and Haryadi S. Gunawi . 2017. Tiny-tail flash: Near-perfect elimination of garbage collection tail latencies in NAND SSDs . In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST\u201917) . Shiqin Yan, Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swaminathan Sundararaman, Andrew A. Chien, and Haryadi S. Gunawi. 2017. Tiny-tail flash: Near-perfect elimination of garbage collection tail latencies in NAND SSDs. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST\u201917)."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/2043556.2043572"}],"container-title":["ACM Transactions on Storage"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3242086","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3242086","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3242086","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:43:35Z","timestamp":1750207415000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3242086"}},"subtitle":["Evidence of Hardware Performance Faults in Large Production Systems"],"short-title":[],"issued":{"date-parts":[[2018,8,31]]},"references-count":43,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2018,8,31]]}},"alternative-id":["10.1145\/3242086"],"URL":"https:\/\/doi.org\/10.1145\/3242086","relation":{},"ISSN":["1553-3077","1553-3093"],"issn-type":[{"value":"1553-3077","type":"print"},{"value":"1553-3093","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,8,31]]},"assertion":[{"value":"2018-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-10-03","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}