{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,30]],"date-time":"2025-08-30T00:06:32Z","timestamp":1756512392184,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":21,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,5,14]]},"DOI":"10.1145\/3713082.3730380","type":"proceedings-article","created":{"date-parts":[[2025,8,29]],"date-time":"2025-08-29T16:47:25Z","timestamp":1756486045000},"page":"172-178","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Analyzing Metastable Failures"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-6737-1503","authenticated-orcid":false,"given":"Rebecca","family":"Isaacs","sequence":"first","affiliation":[{"name":"AWS, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6672-240X","authenticated-orcid":false,"given":"Peter","family":"Alvaro","sequence":"additional","affiliation":[{"name":"UC Santa Cruz and AWS, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2136-0542","authenticated-orcid":false,"given":"Rupak","family":"Majumdar","sequence":"additional","affiliation":[{"name":"MPI-SWS and AWS, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-2291-9485","authenticated-orcid":false,"given":"Kiran-Kumar","family":"Muniswamy-Reddy","sequence":"additional","affiliation":[{"name":"AWS, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3790-3935","authenticated-orcid":false,"given":"Mahmoud","family":"Salamati","sequence":"additional","affiliation":[{"name":"MPI-SWS, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1922-6678","authenticated-orcid":false,"given":"Sadegh","family":"Soudjani","sequence":"additional","affiliation":[{"name":"MPI-SWS, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,6,6]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"API ChatGPT & Sora facing issues. https:\/\/status.openai.com\/incidents\/ctrsv3lwd797."},{"key":"e_1_3_2_1_2_1","unstructured":"Summary of the Amazon Kinesis Data Streams service event in Northern Virginia (US-EAST-1) region. https:\/\/aws.amazon.com\/message\/073024\/."},{"key":"e_1_3_2_1_3_1","volume-title":"Artalejo and Antonio G\u00f3mez-Corral. Retrial queueing systems: A computational approach","author":"Jes\u00fas","year":"2008","unstructured":"Jes\u00fas R. Artalejo and Antonio G\u00f3mez-Corral. Retrial queueing systems: A computational approach. Springer, 2008."},{"key":"e_1_3_2_1_4_1","volume-title":"Metastability and low lying spectra in reversible Markov chains. Commun. Math. Phys., (228):219--255","author":"Bovier Anton","year":"2002","unstructured":"Anton Bovier, Michael Eckhoff, Veronique Gayrard, and Markus Klein. Metastability and low lying spectra in reversible Markov chains. Commun. Math. Phys., (228):219--255, 2002."},{"key":"e_1_3_2_1_5_1","first-page":"1","article-title":"Recommendations on queue management and congestion avoidance in the internet","volume":"2309","author":"Braden Bob","year":"1998","unstructured":"Bob Braden, David D. Clark, Jon Crowcroft, Bruce S. Davie, Steve Deering, Deborah Estrin, Sally Floyd, Van Jacobson, Greg Minshall, Craig Partridge, Larry L. Peterson, K. K. Ramakrishnan, Scott Shenker, John Wroclawski, and Lixia Zhang. Recommendations on queue management and congestion avoidance in the internet. RFC, 2309:1--17, 1998.","journal-title":"RFC"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458336.3465286"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3689778"},{"key":"e_1_3_2_1_8_1","volume-title":"Using STAMP to improve resilience in Google production systems. In","author":"Falzone Tim","year":"2024","unstructured":"Tim Falzone and Ben Treynor Sloss. Using STAMP to improve resilience in Google production systems. In; login:, 2024."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4684-0176-9"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/SRDS64841.2024.00013"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9781139226424"},{"key":"e_1_3_2_1_12_1","first-page":"73","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Huang Lexiang","year":"2022","unstructured":"Lexiang Huang, Matthew Magnusson, Abishek Bangalore Muralikrishna, Salman Estyak, Rebecca Isaacs, Abutalib Aghayev, Timothy Zhu, and Aleksey Charapko. Metastable failures in the wild. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 73--90, Carlsbad, CA, July 2022. USENIX Association."},{"key":"e_1_3_2_1_13_1","volume-title":"Measurement, Simulation, and Modeling","author":"Jain Raj","year":"1991","unstructured":"Raj Jain. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley-Interscience, 1991."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2838344.2839461"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASE56229.2023.00032"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.14778\/3681954.3681980"},{"key":"e_1_3_2_1_17_1","first-page":"127","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022","author":"Sethi Utsav","year":"2022","unstructured":"Utsav Sethi, Haochen Pan, Shan Lu, Madanlal Musuvathi, and Suman Nath. Cancellation in systems: An empirical study of task cancellation patterns and failures. In Marcos K. Aguilera and Hakim Weatherspoon, editors, 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, pages 127--141. USENIX Association, 2022."},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3694715.3695971"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3552326.3587448"},{"key":"e_1_3_2_1_20_1","first-page":"1","volume-title":"Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware, LADIS '08","author":"van Renesse Robbert","year":"2008","unstructured":"Robbert van Renesse, Rodrigo Rodrigues, Mike Spreitzer, Christopher Stewart, Doug Terry, and Franco Travostino. Challenges facing tomorrow's datacenter: summary of the LADiS workshop. In Eliezer Dekel and Gregory V. Chockler, editors, Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware, LADIS '08, Yorktown Heights, New York, USA, September 15-17, 2008, pages 1:1--1:7. ACM, 2008."},{"key":"e_1_3_2_1_21_1","first-page":"249","volume-title":"11th USENIX Symposium on Operating Systems Design and Implementation, OSDI '14","author":"Yuan Ding","year":"2014","unstructured":"Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay Jain, and Michael Stumm. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In Jason Flinn and Hank Levy, editors, 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI '14, Broomfield, CO, USA, October 6-8, 2014, pages 249--265. USENIX Association, 2014."}],"event":{"name":"HOTOS '25: Workshop on Hot Topics in Operating Systems","location":"Banff AB Canada","acronym":"HOTOS '25","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems"]},"container-title":["Proceedings of the Workshop on Hot Topics in Operating Systems"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3713082.3730380","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,29]],"date-time":"2025-08-29T16:49:03Z","timestamp":1756486143000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3713082.3730380"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,14]]},"references-count":21,"alternative-id":["10.1145\/3713082.3730380","10.1145\/3713082"],"URL":"https:\/\/doi.org\/10.1145\/3713082.3730380","relation":{},"subject":[],"published":{"date-parts":[[2025,5,14]]},"assertion":[{"value":"2025-06-06","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}