{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,6]],"date-time":"2026-06-06T01:13:18Z","timestamp":1780708398485,"version":"3.54.1"},"publisher-location":"New York, NY, USA","reference-count":115,"publisher":"ACM","license":[{"start":{"date-parts":[[2024,11,20]],"date-time":"2024-11-20T00:00:00Z","timestamp":1732060800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"NSF","award":["2140305"],"award-info":[{"award-number":["2140305"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,11,20]]},"DOI":"10.1145\/3698038.3698568","type":"proceedings-article","created":{"date-parts":[[2024,11,14]],"date-time":"2024-11-14T06:32:43Z","timestamp":1731565963000},"page":"341-360","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud Systems"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-3860-9573","authenticated-orcid":false,"given":"P. C.","family":"Sruthi","sequence":"first","affiliation":[{"name":"Purdue University, West Lafayette, Indiana, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-0236-3959","authenticated-orcid":false,"given":"Zinan","family":"Guo","sequence":"additional","affiliation":[{"name":"Purdue University, West Lafayette, Indiana, USA and Ernst &amp; Young"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-8696-9036","authenticated-orcid":false,"given":"Deming","family":"Chu","sequence":"additional","affiliation":[{"name":"Purdue University, West Lafayette, Indiana, USA and Tongji University, interning at Purdue"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-6012-8312","authenticated-orcid":false,"given":"Zhengyan","family":"Chen","sequence":"additional","affiliation":[{"name":"University of Georgia, Athens, Georgia, USA and Peking University, interning at Purdue"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5350-5182","authenticated-orcid":false,"given":"Yongle","family":"Zhang","sequence":"additional","affiliation":[{"name":"Purdue University, West Lafayette, Indiana, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2024,11,20]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2016. NameNode RepicationMonitor Exception Tracking- Hexiaoqiao. Retrieved 2022-12-10 from https:\/\/hexiaoqiao.github.io\/blog\/2016\/09\/13\/namenode-repicationmonitor-exception-trace\/"},{"key":"e_1_3_2_1_2_1","unstructured":"2016. Remember a DataNode Slow Start Problem - Blog of People on the Android Road - CSDN Blog. Retrieved 2023-04-12 from https:\/\/blog.csdn.net\/Androidlushangderen\/article\/details\/50500136"},{"key":"e_1_3_2_1_3_1","unstructured":"2017. It Must Be All Your Fault! - A Troubleshooting Experience of FastDFS Concurrency Problems- Pure Smile- Blog Garden. Retrieved 2023-09-04 from https:\/\/www.cnblogs.com\/ityouknow\/p\/8123998.html"},{"key":"e_1_3_2_1_4_1","unstructured":"2019. k8s|A Troubleshooting-Tencent Cloud Developer Community-Tencent Cloud. Retrieved 2022-12-09 from https:\/\/cloud.tencent.com\/developer\/article\/1444074"},{"key":"e_1_3_2_1_5_1","unstructured":"2019. The Task Submitted by Hive to Yarn Has Been Running Troubleshooting_hive on Spark Task Running Yarn Ui Shows Running. Retrieved 2023-04-17 from https:\/\/blog.csdn.net\/u013332124\/article\/details\/89283727"},{"key":"e_1_3_2_1_6_1","unstructured":"2020. Remember a Super-Trillion-Scale Hadoop NameNode Performance Troubleshooting Process. Retrieved 2023-04-15 from https:\/\/blog.csdn.net\/weixin_44253169\/article\/details\/105564433"},{"key":"e_1_3_2_1_7_1","unstructured":"2020. Solve the Memory Leak Problem That NodeManager Frequently Triggers FULL-GC after Running for about Half a Year. Retrieved 2022-12-09 from https:\/\/blog.csdn.net\/weixin_43990680\/article\/details\/104754341"},{"key":"e_1_3_2_1_8_1","unstructured":"2020. Why We Switched from Fluent-Bit to Fluentd in 2 Hours. https:\/\/prometheuskube.com\/why-we-switched-from-fluent-bit-to-fluentd-in- 2-hours"},{"key":"e_1_3_2_1_9_1","unstructured":"2021. Analysis and Troubleshooting of a HDFS JournalNode Transaction Lag Problem. Retrieved 2022-12-09 from https:\/\/blog.csdn.net\/Androidlushangderen\/article\/details\/112744149"},{"key":"e_1_3_2_1_10_1","unstructured":"2021. Big Data Troubleshooting Series-HIVE Stepping on the Pit-HIVE-15642_org.Apache.Hadoop.Hive.thrift_Ming Ge's IT Essay Blog-CSDN Blog. Retrieved 2023-04-12 from https:\/\/blog.csdn.net\/ MichaelLi916\/article\/details\/119902075"},{"key":"e_1_3_2_1_11_1","unstructured":"2021. An HBase & HDFS Short-Circuit Read Odyssey. Retrieved 2024-10-12 from https:\/\/blogsarchive.apache.org\/hbase\/entry\/an-hbase-hdfs-short-circuit"},{"key":"e_1_3_2_1_12_1","unstructured":"2021. Record the Upgrading Process of Hadoop Cluster with Thousands of Nodes in the Database. Retrieved 2022-12-12 from https:\/\/zhuanlan.zhihu.com\/p\/163352048"},{"key":"e_1_3_2_1_13_1","unstructured":"2021. Record the Upgrading Process of Hadoop Cluster with Thousands of Nodes in the Database (Part 3). Retrieved 2023-04-16 from https:\/\/zhuanlan.zhihu.com\/p\/163352048"},{"key":"e_1_3_2_1_14_1","unstructured":"2021. Troubleshooting a Problem That HDFS Snapshot Cannot Be Deleted_hadoop Deletion Error Has a Snapshot. Retrieved 2023-04-17 from https:\/\/blog.csdn.net\/Androidlushangderen\/article\/details\/113446906"},{"key":"e_1_3_2_1_15_1","unstructured":"2021. When Kafka Went Offshore. https:\/\/blog.gojek.io\/when-kafka-went-offshore\/"},{"key":"e_1_3_2_1_16_1","unstructured":"2022. Apache Cassandra | Apache Cassandra Documentation. Retrieved 2022-12-13 from https:\/\/cassandra.apache.org\/_\/index.html"},{"key":"e_1_3_2_1_17_1","unstructured":"2022. Apache Flink: Stateful Computations over Data Streams. Retrieved 2022-12-13 from https:\/\/flink.apache.org\/"},{"key":"e_1_3_2_1_18_1","unstructured":"2022. Apache HBase - Apache HBase\u2122 Home. Retrieved 2022-12-13 from https:\/\/hbase.apache.org\/"},{"key":"e_1_3_2_1_19_1","unstructured":"2022. Apache Hive. Retrieved 2022-12-13 from https:\/\/hive.apache.org\/"},{"key":"e_1_3_2_1_20_1","unstructured":"2022. Apache Kafka. Retrieved 2022-12-13 from https:\/\/kafka.apache.org\/"},{"key":"e_1_3_2_1_21_1","unstructured":"2022. Google Cloud Computing Services. Retrieved 2022-12-13 from https:\/\/cloud.google.com\/"},{"key":"e_1_3_2_1_22_1","unstructured":"2022. Hadoop Distributed File System (HDFS). Retrieved 2022-12-13 from https:\/\/hadoop.apache.org\/docs\/stable\/hadoop-project-dist\/hadoop-hdfs\/HdfsDesign.html"},{"key":"e_1_3_2_1_23_1","unstructured":"2022. Kubernetes: Production-Grade Container Orchestration. Retrieved 2022-12-13 from https:\/\/kubernetes.io\/"},{"key":"e_1_3_2_1_24_1","unstructured":"2023. Hive on Mr Job Repeated Execution Problem Troubleshooting_hive Re-Read Execution-CSDN Blog. Retrieved 2023-11-28 from https:\/\/blog.csdn.net\/u013332124\/article\/details\/106575443"},{"key":"e_1_3_2_1_25_1","unstructured":"2023. Remember to Troubleshoot the ETCD OOM Problem Once. Retrieved 2023-11-29 from https:\/\/zhuanlan.zhihu.com\/p\/571473832"},{"key":"e_1_3_2_1_26_1","unstructured":"2023. Twists and Turns - Remember a K8S Cluster Application Troubleshooting - JD Cloud Developer Community. Retrieved 2023-11-30 from https:\/\/developer.jdcloud.com\/article\/1538"},{"key":"e_1_3_2_1_27_1","unstructured":"2024. Apache Spark\u2122 - Unified Engine for Large-Scale Data Analytics. Retrieved 2024-10-15 from https:\/\/spark.apache.org\/"},{"key":"e_1_3_2_1_28_1","unstructured":"2024. Etcd. Retrieved 2024-10-15 from https:\/\/etcd.io\/"},{"key":"e_1_3_2_1_29_1","unstructured":"2024. Gemini. Retrieved 2024-10-06 from https:\/\/gemini.google.com"},{"key":"e_1_3_2_1_30_1","unstructured":"2024. Gojek Super App. Retrieved 2024-10-15 from https:\/\/www.gojek.com\/en-id"},{"key":"e_1_3_2_1_31_1","unstructured":"2024. Hive Cannot Submit to Yarn_08235.15.1 Analysis of the Problem of Slow Hive Query Caused by Slow HDFS - CSDN Blog. Retrieved 2024-10-12 from https:\/\/blog.csdn.net\/weixin_42139302\/article\/details\/112092820"},{"key":"e_1_3_2_1_32_1","unstructured":"2024. MongoDB: The Developer Data Platform | MongoDB. Retrieved 2024-10-15 from https:\/\/www.mongodb.com\/"},{"key":"e_1_3_2_1_33_1","unstructured":"2024. Netflix\/Chaosmonkey. Netflix Inc.. https:\/\/github.com\/Netflix\/chaosmonkey"},{"key":"e_1_3_2_1_34_1","unstructured":"2024. PagerDuty | Real-Time Operations | Incident Response | On-Call. Retrieved 2024-10-15 from https:\/\/www.pagerduty.com\/"},{"key":"e_1_3_2_1_35_1","unstructured":"2024. Redis - The Real-time Data Platform. Retrieved 2024-10-15 from https:\/\/redis.io\/"},{"key":"e_1_3_2_1_36_1","unstructured":"2024. TiDB Powered by PingCAP. Retrieved 2024-10-15 from https:\/\/www.pingcap.com\/"},{"key":"e_1_3_2_1_37_1","unstructured":"2024. Zookeeper Once Fault Handling_exception Causing Close. Retrieved 2024-10-15 from https:\/\/blog.csdn.net\/huochen1994\/article\/details\/79288194"},{"key":"e_1_3_2_1_38_1","unstructured":"2024. ZooKeeper Overview. Retrieved 2024-10-16 from https:\/\/zookeeper.apache.org\/doc\/r3.1.2\/zookeeperOver.html"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/1165389.945454"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/1629575.1629594"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2723711"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/1453101.1453146"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3106237.3106255"},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2408776.2408795"},{"key":"e_1_3_2_1_45_1","volume-title":"REPT: Reverse Debugging of Failures in Deployed Software. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Cui Weidong","year":"2018","unstructured":"Weidong Cui, Xinyang Ge, Baris Kasikci, Ben Niu, Upamanyu Sharma, Ruoyu Wang, and Insu Yun. 2018. REPT: Reverse Debugging of Failures in Deployed Software. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 17--32. https:\/\/www.usenix.org\/conference\/osdi18\/presentation\/weidong"},{"key":"e_1_3_2_1_46_1","unstructured":"Datadog. 16:32:45 -0400 -0400. Cloud Monitoring as a Service | Data-dog. Retrieved 2022-10-14 from https:\/\/www.datadoghq.com\/"},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/2371536.2371572"},{"key":"e_1_3_2_1_48_1","volume-title":"Insights and Tools for Root Cause Labelling of Incidents in Microsoft Azure. In 2023 USENIX Annual Technical Conference (USENIX ATC 23)","author":"Dogga Pradeep","year":"2023","unstructured":"Pradeep Dogga, Chetan Bansal, Richard Costleigh, Gopinath Jayagopal, Suman Nath, and Xuchao Zhang. 2023. AutoARTS: Taxonomy, Insights and Tools for Root Cause Labelling of Incidents in Microsoft Azure. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). 359--372."},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/248448.248456"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/2491956.2462162"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389694"},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3445814.3446700"},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304004"},{"key":"e_1_3_2_1_54_1","first-page":"285","article-title":"Friday: Global Comprehension for Distributed Replay","volume":"7","author":"Geels Dennis","year":"2007","unstructured":"Dennis Geels, Gautam Altekar, Petros Maniatis, Timothy Roscoe, and Ion Stoica. 2007. Friday: Global Comprehension for Distributed Replay.. In NSDI, Vol. 7. 285--298.","journal-title":"NSDI"},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/3542929.3563482"},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/1629575.1629586"},{"key":"e_1_3_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1177\/001872087401600308"},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/22627.22367"},{"key":"e_1_3_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/2670979.2670986"},{"key":"e_1_3_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/3242086"},{"key":"e_1_3_2_1_61_1","unstructured":"Dominic Gunn. 2018. Kubernetes and the Menace ELB the Tale of an Outage. Retrieved 2023-04-13 from https:\/\/itnext.io\/kubernetes-and-the-menace-elb-the-tale-of-an-outage-c00bef678fc0"},{"key":"e_1_3_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3141235.3141236"},{"key":"e_1_3_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3102980.3103005"},{"key":"e_1_3_2_1_64_1","volume-title":"USENIX Annual Technical Conference","volume":"8","author":"Hunt Patrick","year":"2010","unstructured":"Patrick Hunt, Mahadev Konar, Flavio Paiva Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free Coordination for Internet-scale Systems.. In USENIX Annual Technical Conference, Vol. 8."},{"key":"e_1_3_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.14722\/ndss.2022.23057"},{"key":"e_1_3_2_1_66_1","volume-title":"Post Mortem: Kubernetes Node OOM. Retrieved 2022-12-11 from https:\/\/www.bluematador.com\/blog\/post-mortem-kubernetes-node-oom","author":"Jackson Keilan","year":"2019","unstructured":"Keilan Jackson. 2019. Post Mortem: Kubernetes Node OOM. Retrieved 2022-12-11 from https:\/\/www.bluematador.com\/blog\/post-mortem-kubernetes-node-oom"},{"key":"e_1_3_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/1808954.1808966"},{"key":"e_1_3_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/2254064.2254075"},{"key":"e_1_3_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1145\/3132747.3132767"},{"key":"e_1_3_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1145\/2815400.2815412"},{"key":"e_1_3_2_1_71_1","volume-title":"Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems. 517--530","author":"Leesatapornwongsa Tanakorn","unstructured":"Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems. 517--530."},{"key":"e_1_3_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1145\/2361276.2361301"},{"key":"e_1_3_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10664-021-10063-9"},{"key":"e_1_3_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1145\/3190508.3190552"},{"key":"e_1_3_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE.2013.6606646"},{"key":"e_1_3_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1145\/1064978.1065014"},{"key":"e_1_3_2_1_77_1","unstructured":"Xuezheng Liu Zhenyu Guo Xi Wang Feibo Chen Xiaochen Lian Jian Tang Ming Wu M. Frans Kaashoek and Zheng Zhang. 2008. D3S: Debugging Deployed Distributed Systems. In NSDI. https:\/\/www.usenix.org\/event\/nsdi08\/tech\/full_papers\/liu_xuezheng\/liu_xuezheng_html\/"},{"key":"e_1_3_2_1_78_1","first-page":"559","article-title":"Understanding, Detecting and Localizing Partial Failures in Large System Software","volume":"20","author":"Lou Chang","year":"2020","unstructured":"Chang Lou, Peng Huang, and Scott Smith. 2020. Understanding, Detecting and Localizing Partial Failures in Large System Software.. In NSDI, Vol. 20. 559--574.","journal-title":"NSDI"},{"key":"e_1_3_2_1_79_1","volume-title":"Fourth Symposium on Operating Systems Design and Implementation (OSDI","author":"Lowell David E","year":"2000","unstructured":"David E Lowell, Subhachandra Chandra, and Peter Chen. 2000. Exploring failure transparency and the limits of generic recovery. In Fourth Symposium on Operating Systems Design and Implementation (OSDI 2000)."},{"key":"e_1_3_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1145\/2815400.2815415"},{"key":"e_1_3_2_1_81_1","unstructured":"David McGinnis. 2020. Debugging From The Field: The Case of the Empty Files. Retrieved 2022-12-11 from https:\/\/www.davidmcginnis.net\/post\/debugging-from-the-field-the-case-of-the-empty-files"},{"key":"e_1_3_2_1_82_1","unstructured":"Yash Mehrotra. 2020. The Case of the Missing Packet: An EKS Migration Tale. Retrieved 2024-06-06 from https:\/\/yashmehrotra.com\/posts\/the-case-of-the-missing-packet-an-eks-migration-tale\/"},{"key":"e_1_3_2_1_83_1","volume-title":"Engineering Record and Replay for Deployability. In 2017 USENIX Annual Technical Conference (USENIX ATC 17)","author":"O'Callahan Robert","year":"2017","unstructured":"Robert O'Callahan, Chris Jones, Nathan Froyd, Kyle Huey, Albert Noll, and Nimrod Partush. 2017. Engineering Record and Replay for Deployability. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). 377--389. https:\/\/www.usenix.org\/conference\/atc17\/technical-sessions\/presentation\/ocallahan"},{"key":"e_1_3_2_1_84_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2016.2575829"},{"key":"e_1_3_2_1_85_1","unstructured":"Prometheus. 2022. Prometheus - Monitoring System & Time Series Database. Retrieved 2022-10-14 from https:\/\/prometheus.io\/"},{"key":"e_1_3_2_1_86_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASE56229.2023.00032"},{"key":"e_1_3_2_1_87_1","volume-title":"How Hadoop Clusters Break","author":"Rabkin Ariel","year":"2012","unstructured":"Ariel Rabkin and Randy Howard Katz. 2012. How Hadoop Clusters Break. IEEE software 30, 4 (2012), 88--94."},{"key":"e_1_3_2_1_88_1","unstructured":"Sid Rathi. 2021. Solving a Native Memory Leak. https:\/\/medium.com\/expedia-group-tech\/solving-a-native-memory-leak-71fe4b6f9463"},{"key":"e_1_3_2_1_89_1","volume-title":"Minimizing Faulty Executions of Distributed Systems. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16)","author":"Scott Colin","year":"2016","unstructured":"Colin Scott, Vjekoslav Brajkovic, George Necula, Arvind Krishnamurthy, and Scott Shenker. 2016. Minimizing Faulty Executions of Distributed Systems. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). 291--309."},{"key":"e_1_3_2_1_90_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSREW.2014.36"},{"key":"e_1_3_2_1_91_1","volume-title":"Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag.","author":"Sigelman Benjamin H.","year":"2010","unstructured":"Benjamin H. Sigelman, Luiz Andr\u00e9 Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. (2010)."},{"key":"e_1_3_2_1_92_1","doi-asserted-by":"publisher","DOI":"10.1145\/3501297"},{"key":"e_1_3_2_1_93_1","doi-asserted-by":"publisher","DOI":"10.1145\/3501297"},{"key":"e_1_3_2_1_94_1","unstructured":"Hou Song. 2016. The Problem of Hbase Fast Reconnecting DataNode - Discovery and Analysis. Retrieved 2022-12-10 from http:\/\/housong.github.io\/2016\/hbase-reconnect-dn\/"},{"key":"e_1_3_2_1_95_1","doi-asserted-by":"publisher","DOI":"10.1145\/2660193.2660234"},{"key":"e_1_3_2_1_96_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDCS.2012.65"},{"key":"e_1_3_2_1_97_1","doi-asserted-by":"publisher","DOI":"10.1145\/1294261.1294275"},{"key":"e_1_3_2_1_98_1","doi-asserted-by":"publisher","DOI":"10.1145\/2523616.2523633"},{"key":"e_1_3_2_1_99_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0020-7373(85)80054-7"},{"key":"e_1_3_2_1_100_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDCS.2018.00074"},{"key":"e_1_3_2_1_101_1","doi-asserted-by":"publisher","DOI":"10.1145\/3552326.3587444"},{"key":"e_1_3_2_1_102_1","doi-asserted-by":"publisher","DOI":"10.1145\/1629575.1629587"},{"key":"e_1_3_2_1_103_1","volume-title":"Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm.","author":"Yuan Ding","year":"2014","unstructured":"Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 249--265."},{"key":"e_1_3_2_1_104_1","doi-asserted-by":"publisher","DOI":"10.1145\/1736020.1736038"},{"key":"e_1_3_2_1_105_1","unstructured":"YuQing. 2024. Happyfish100\/Fastdfs. https:\/\/github.com\/happyfish100\/fastdfs"},{"key":"e_1_3_2_1_106_1","volume-title":"Presented as Part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). 15--28.","author":"Zaharia Matei","unstructured":"Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for in-Memory Cluster Computing. In Presented as Part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). 15--28."},{"key":"e_1_3_2_1_107_1","doi-asserted-by":"publisher","DOI":"10.1145\/1134285.1134324"},{"key":"e_1_3_2_1_108_1","doi-asserted-by":"publisher","DOI":"10.1145\/1250734.1250782"},{"key":"e_1_3_2_1_109_1","doi-asserted-by":"publisher","DOI":"10.1145\/3459637.3481903"},{"key":"e_1_3_2_1_110_1","doi-asserted-by":"publisher","DOI":"10.1145\/3132747.3132768"},{"key":"e_1_3_2_1_111_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359650"},{"key":"e_1_3_2_1_112_1","doi-asserted-by":"publisher","DOI":"10.1145\/3477132.3483577"},{"key":"e_1_3_2_1_113_1","doi-asserted-by":"publisher","DOI":"10.1145\/3132747.3132778"},{"key":"e_1_3_2_1_114_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2018.2887384"},{"key":"e_1_3_2_1_115_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSC.2019.2919823"}],"event":{"name":"SoCC '24: ACM Symposium on Cloud Computing","location":"Redmond WA USA","acronym":"SoCC '24","sponsor":["SIGMOD ACM Special Interest Group on Management of Data","SIGOPS ACM Special Interest Group on Operating Systems"]},"container-title":["Proceedings of the ACM Symposium on Cloud Computing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3698038.3698568","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3698038.3698568","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T18:59:10Z","timestamp":1755889150000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3698038.3698568"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,20]]},"references-count":115,"alternative-id":["10.1145\/3698038.3698568","10.1145\/3698038"],"URL":"https:\/\/doi.org\/10.1145\/3698038.3698568","relation":{},"subject":[],"published":{"date-parts":[[2024,11,20]]},"assertion":[{"value":"2024-11-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}