{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,23]],"date-time":"2026-01-23T08:16:07Z","timestamp":1769156167336,"version":"3.49.0"},"reference-count":120,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2022,4,28]],"date-time":"2022-04-28T00:00:00Z","timestamp":1651104000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"NSF","award":["CCF-1717630\/1853714, CCF-1910747, and CNS-1943204"],"award-info":[{"award-number":["CCF-1717630\/1853714, CCF-1910747, and CNS-1943204"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Storage"],"published-print":{"date-parts":[[2022,5,31]]},"abstract":"<jats:p>Large-scale parallel file systems (PFSs) play an essential role in high-performance computing (HPC). However, despite their importance, their reliability is much less studied or understood compared with that of local storage systems or cloud storage systems. Recent failure incidents at real HPC centers have exposed the latent defects in PFS clusters as well as the urgent need for a systematic analysis.<\/jats:p>\n          <jats:p>\n            To address the challenge, we perform a study of the failure recovery and logging mechanisms of PFSs in this article. First, to trigger the failure recovery and logging operations of the target PFS, we introduce a black-box fault injection tool called \u00a0\n            <jats:sc>PFault<\/jats:sc>\n            , which is transparent to PFSs and easy to deploy in practice. \u00a0\n            <jats:sc>PFault<\/jats:sc>\n            emulates the failure state of individual storage nodes in the PFS based on a set of pre-defined fault models and enables examining the PFS behavior under fault systematically.\n          <\/jats:p>\n          <jats:p>\n            Next, we apply\n            <jats:sc>PFault<\/jats:sc>\n            to study two widely used PFSs: Lustre and BeeGFS. Our analysis reveals the unique failure recovery and logging patterns of the target PFSs and identifies multiple cases where the PFSs are imperfect in terms of failure handling. For example, Lustre includes a recovery component called LFSCK to detect and fix PFS-level inconsistencies, but we find that LFSCK itself may hang or trigger kernel panics when scanning a corrupted Lustre. Even after the recovery attempt of LFSCK, the subsequent workloads applied to Lustre may still behave abnormally (e.g., hang or report I\/O errors). Similar issues have also been observed in BeeGFS and its recovery component BeeGFS-FSCK. We analyze the root causes of the abnormal symptoms observed in depth, which has led to a new patch set to be merged into the coming Lustre release. In addition, we characterize the extensive logs generated in the experiments in detail and identify the unique patterns and limitations of PFSs in terms of failure logging. We hope this study and the resulting tool and dataset can facilitate follow-up research in the communities and help improve PFSs for reliable high-performance computing.\n          <\/jats:p>","DOI":"10.1145\/3483447","type":"journal-article","created":{"date-parts":[[2022,3,29]],"date-time":"2022-03-29T11:38:40Z","timestamp":1648553920000},"page":"1-44","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":13,"title":["A Study of Failure Recovery and Logging of High-Performance Parallel File Systems"],"prefix":"10.1145","volume":"18","author":[{"given":"Runzhou","family":"Han","sequence":"first","affiliation":[{"name":"Iowa State University, Ames, Iowa"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Om Rameshwar","family":"Gatla","sequence":"additional","affiliation":[{"name":"Iowa State University, Ames, Iowa"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0741-3436","authenticated-orcid":false,"given":"Mai","family":"Zheng","sequence":"additional","affiliation":[{"name":"Iowa State University, Ames, Iowa"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jinrui","family":"Cao","sequence":"additional","affiliation":[{"name":"State University of New York at Plattsburgh, Plattsburgh, New York"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Di","family":"Zhang","sequence":"additional","affiliation":[{"name":"North Carolina University at Charlotte, Charlotte, North Carolina"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4078-8149","authenticated-orcid":false,"given":"Dong","family":"Dai","sequence":"additional","affiliation":[{"name":"North Carolina University at Charlotte, Charlotte, North Carolina"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yong","family":"Chen","sequence":"additional","affiliation":[{"name":"Texas Tech University, Lubbock, Texas"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jonathan","family":"Cook","sequence":"additional","affiliation":[{"name":"New Mexico State University, Las Cruces, New Mexico"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,4,28]]},"reference":[{"key":"e_1_3_3_2_2","unstructured":"Lustre File System. http:\/\/lustre.org\/."},{"key":"e_1_3_3_3_2","unstructured":"BeeGFS File System. https:\/\/www.beegfs.io\/."},{"key":"e_1_3_3_4_2","unstructured":"The OrangeFS Project. 2017. http:\/\/www.orangefs.org\/."},{"key":"e_1_3_3_5_2","doi-asserted-by":"publisher","DOI":"10.5555\/2591272.2591300"},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/2820615"},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/2815400.2815422"},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359662"},{"key":"e_1_3_3_9_2","volume-title":"Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201914)","author":"Pillai Thanumalayan Sankaranarayana","year":"2014","unstructured":"Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. All file systems are not created equal: On the complexity of crafting crash-consistent applications. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201914)."},{"key":"e_1_3_3_10_2","first-page":"399","volume-title":"11th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201914)","author":"Leesatapornwongsa Tanakorn","year":"2014","unstructured":"Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. 2014. SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201914). USENIX Association, 399\u2013414. https:\/\/www.usenix.org\/conference\/osdi14\/technical-sessions\/presentation\/leesatapornwongsa"},{"key":"e_1_3_3_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359645"},{"key":"e_1_3_3_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/2048066.2048082"},{"key":"e_1_3_3_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/2524211.2524217"},{"key":"e_1_3_3_14_2","unstructured":"Hadoop Distributed File System. 2006-now. https:\/\/hadoop.apache.org\/docs\/r1.2.1\/hdfs_design.html."},{"key":"e_1_3_3_15_2","unstructured":"Apache Cassandra. 2008-now. https:\/\/cassandra.apache.org."},{"key":"e_1_3_3_16_2","unstructured":"Apache Zookeeper. Retrieved January 2021 https:\/\/zookeeper.apache.org."},{"key":"e_1_3_3_17_2","unstructured":"High Performance Computing Center Texas Tech University. 2017. http:\/\/www.depts.ttu.edu\/hpcc\/."},{"key":"e_1_3_3_18_2","unstructured":"Power Outage Event at High Performance Computing Center (HPCC) in Texas. 2016. https:\/\/www.ece.iastate.edu\/mai\/docs\/failures\/2016-hpcc-lustre.pdf."},{"key":"e_1_3_3_19_2","unstructured":"GPFS Failures at Ohio Supercomputer Center (OSC). 2016. https:\/\/www.ece.iastate.edu\/mai\/docs\/failures\/2016-hpcc-lustre.pdf."},{"key":"e_1_3_3_20_2","unstructured":"Multiple Switch Outages at Ohio Supercomputer Center (OSC). 2016. https:\/\/www.ece.iastate.edu\/mai\/docs\/failures\/2016-hpcc-lustre.pdf."},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3242086"},{"key":"e_1_3_3_22_2","unstructured":"Aishwarya Ganesan Ramnatthan Alagappan Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau. 2017. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In 15th USENIX Conference on File and Storage Technologies (FAST\u201917) . 149\u2013166."},{"key":"e_1_3_3_23_2","unstructured":"Haryadi S. Gunawi Thanh Do Pallavi Joshi Peter Alvaro Joseph M. Hellerstein Andrea C. Arpaci-Dusseau Remzi H. Arpaci-Dusseau Koushik Sen and Dhruba Borthakur. 2011. FATE and DESTINI: A framework for cloud recovery testing. In Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201911) ."},{"key":"e_1_3_3_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3241062"},{"key":"e_1_3_3_25_2","unstructured":"Open MPI. 2004-now. https:\/\/www.open-mpi.org."},{"key":"e_1_3_3_26_2","unstructured":"Lustre Software Release 2.x: Operations Manual. 2017. http:\/\/lustre.org\/documentation\/."},{"key":"e_1_3_3_27_2","first-page":"29","volume-title":"ACM SIGOPS Operating Systems Review","author":"Ghemawat Sanjay","year":"2003","unstructured":"Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In ACM SIGOPS Operating Systems Review, Vol. 37. ACM, 29\u201343."},{"key":"e_1_3_3_28_2","doi-asserted-by":"publisher","DOI":"10.5555\/2750482.2750499"},{"key":"e_1_3_3_29_2","doi-asserted-by":"publisher","DOI":"10.5555\/983238"},{"key":"e_1_3_3_30_2","unstructured":"Linux SCSI target framework (tgt). 2017. http:\/\/stgt.sourceforge.net\/."},{"key":"e_1_3_3_31_2","unstructured":"NVM Express over Fabrics Specification Released. 2017. http:\/\/www.nvmexpress.org\/nvm-express-over-fabrics-specification-released\/."},{"key":"e_1_3_3_32_2","unstructured":"LFSCK: An online file system checker for Lustre. 2017. https:\/\/github.com\/Xyratex\/lustre-stable\/blob\/master\/Documentation\/lfsck.txt."},{"key":"e_1_3_3_33_2","unstructured":"Apache log4j a logging library for Java. 2001-now. http:\/\/logging.apache.org\/log4j\/2.x\/."},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3205289.3205302"},{"key":"e_1_3_3_35_2","unstructured":"Lustre Patch: LU-13980 osd: remove osd_object_release LASSERT. 2020. https:\/\/review.whamcloud.com\/#\/c\/40058\/."},{"key":"e_1_3_3_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/50202.50214"},{"key":"e_1_3_3_37_2","unstructured":"HPC User Site Census. 2016. http:\/\/www.intersect360.com\/."},{"key":"e_1_3_3_38_2","unstructured":"Top500 Supercomputers. 2019. https:\/\/www.top500.org\/lists\/2016\/11\/."},{"key":"e_1_3_3_39_2","unstructured":"Apache HBase. 2020. https:\/\/hbase.apache.org."},{"key":"e_1_3_3_40_2","unstructured":"BeeGFS Documentation v7.2. 2020. https:\/\/doc.beegfs.io\/latest\/overview\/overview.html."},{"key":"e_1_3_3_41_2","unstructured":"SQLite documents. 2017. http:\/\/www.sqlite.org\/docs.html."},{"key":"e_1_3_3_42_2","volume-title":"Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201908)","author":"Gunawi Haryadi S.","year":"2008","unstructured":"Haryadi S. Gunawi, Abhishek Rajimwale, Andrea C. Arpaci-dusseau, and Remzi H. Arpaci-dusseau. 2008. SQCK: A declarative file system checker. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201908)."},{"key":"e_1_3_3_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDS.1996.540200"},{"key":"e_1_3_3_44_2","doi-asserted-by":"crossref","first-page":"204","DOI":"10.1109\/IPDS.1995.395831","volume-title":"Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium","author":"Han Seungjae","year":"1995","unstructured":"Seungjae Han, K. G. Shin, and H. A. Rosenberg. 1995. DOCTOR: An integrated software fault injection environment for distributed real-time systems. In Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium. 204\u2013213."},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDS.2000.839467"},{"key":"e_1_3_3_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/12.54853"},{"key":"e_1_3_3_47_2","unstructured":"Jepsen. https:\/\/github.com\/jepsen-io\/jepsen."},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378484"},{"key":"e_1_3_3_49_2","first-page":"359","volume-title":"17th USENIX Conference on File and Storage Technologies (FAST\u201919)","author":"Stuardo Cesar A.","year":"2019","unstructured":"Cesar A. Stuardo, Tanakorn Leesatapornwongsa, Riza O. Suminto, Huan Ke, Jeffrey F. Lukman, Wei-Chiu Chuang, Shan Lu, and Haryadi S. Gunawi. 2019. ScaleCheck: A single-machine approach for discovering scalability bugs in large distributed systems. In 17th USENIX Conference on File and Storage Technologies (FAST\u201919). 359\u2013373. https:\/\/www.usenix.org\/conference\/fast19\/presentation\/stuardo."},{"key":"e_1_3_3_50_2","unstructured":"Apache Hadoop YARN. 2020. https:\/\/hadoop.apache.org\/docs\/current\/hadoop-yarn\/hadoop-yarn-site\/YARN.html."},{"key":"e_1_3_3_51_2","unstructured":"Apache Hadoop. 2019. https:\/\/hadoop.apache.org\/docs\/stable\/."},{"key":"e_1_3_3_52_2","unstructured":"E2fsprogs: Ext2\/3\/4 Filesystems Utilities. 2017. http:\/\/e2fsprogs.sourceforge.net"},{"key":"e_1_3_3_53_2","volume-title":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST\u201919)","author":"Dai Dong","year":"2019","unstructured":"Dong Dai, Om Rameshwar Gatla, and Mai Zheng. 2019. A performance study of lustre file system checker: Bottlenecks and potentials. In 2019 35th Symposium on Mass Storage Systems and Technologies (MSST\u201919)."},{"key":"e_1_3_3_54_2","unstructured":"FUSE. Linux FUSE (Filesystem in Userspace) interface. https:\/\/github.com\/libfuse\/libfuse."},{"key":"e_1_3_3_55_2","unstructured":"R. Sandberg D. Golgberg S. Kleiman D. Walsh and B. Lyon. 1988. Design and implementation of the Sun network filesystem. In Innovations in Internetworking . Artech House Inc. Norwood MA 379\u2013390. http:\/\/dl.acm.org\/citation.cfm?id=59309.59338."},{"key":"e_1_3_3_56_2","article-title":"An introduction to fibre channel","author":"Primmer Meryem","year":"1996","unstructured":"Meryem Primmer. 1996. An introduction to fibre channel. HP Journal (1996).","journal-title":"HP Journal"},{"key":"e_1_3_3_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/1416944.1416947"},{"key":"e_1_3_3_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/1254882.1254917"},{"key":"e_1_3_3_59_2","doi-asserted-by":"publisher","DOI":"10.1145\/1966445.1966477"},{"key":"e_1_3_3_60_2","volume-title":"Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST\u201907)","author":"Schroeder Bianca","year":"2007","unstructured":"Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST\u201907)."},{"key":"e_1_3_3_61_2","first-page":"509","volume-title":"ICDE","author":"Subramanian Sriram","year":"2010","unstructured":"Sriram Subramanian, Yupu Zhang, Rajiv Vaidyanathan, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Jeffrey F. Naughton. 2010. Impact of disk corruption on open-source DBMS. In ICDE. 509\u2013520."},{"key":"e_1_3_3_62_2","first-page":"449","volume-title":"11th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201914)","author":"Zheng Mai","year":"2014","unstructured":"Mai Zheng, Joseph Tucek, Dachuan Huang, Feng Qin, Mark Lillibridge, Elizabeth S. Yang, Bill W. Zhao, and Shashank Singh. 2014. Torturing databases for fun and profit. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201914). USENIX Association, 449\u2013464. https:\/\/www.usenix.org\/conference\/osdi14\/technical-sessions\/presentation\/zheng_mai."},{"key":"e_1_3_3_63_2","unstructured":"Network Partition. 2017. https:\/\/www.cs.cornell.edu\/courses\/cs614\/2003sp\/papers\/DGS85.pdf."},{"key":"e_1_3_3_64_2","unstructured":"e2fsck(8) \u2014 Linux manual page. 2017. https:\/\/man7.org\/linux\/man-pages\/man8\/e2fsck.8.html."},{"key":"e_1_3_3_65_2","doi-asserted-by":"publisher","DOI":"10.1145\/1095810.1095830"},{"key":"e_1_3_3_66_2","unstructured":"debugfs. 2017. http:\/\/man7.org\/linux\/man-pages\/man8\/debugfs.8.html."},{"key":"e_1_3_3_67_2","doi-asserted-by":"publisher","DOI":"10.5555\/3291168.3291173"},{"key":"e_1_3_3_68_2","doi-asserted-by":"publisher","DOI":"10.1145\/2043164.2018477"},{"key":"e_1_3_3_69_2","first-page":"203","volume-title":"ACM SIGMETRICS Performance Evaluation Review","author":"Smith Keith A.","year":"1997","unstructured":"Keith A. Smith and Margo I. Seltzer. 1997. File system aging-increasing the relevance of file system benchmarks. In ACM SIGMETRICS Performance Evaluation Review, Vol. 25. ACM, 203\u2013213."},{"key":"e_1_3_3_70_2","doi-asserted-by":"publisher","DOI":"10.5555\/3129633.3129639"},{"key":"e_1_3_3_71_2","unstructured":"CloudLab. http:\/\/cloudlab.us\/."},{"key":"e_1_3_3_72_2","unstructured":"Montage: An Astronomical Image Mosaic Engine. 2017. http:\/\/montage.ipac.caltech.edu\/."},{"key":"e_1_3_3_73_2","unstructured":"Wikipedia:Database download. 2017. https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Database_download."},{"key":"e_1_3_3_74_2","doi-asserted-by":"publisher","DOI":"10.5555\/3019046.3019055"},{"key":"e_1_3_3_75_2","doi-asserted-by":"publisher","DOI":"10.1145\/2168836.2168861"},{"key":"e_1_3_3_76_2","unstructured":"Simple logging facade for Java. 2019. http:\/\/www.slf4j.org."},{"key":"e_1_3_3_77_2","doi-asserted-by":"crossref","unstructured":"Mai Zheng Joseph Tucek Feng Qin Mark Lillibridge Bill W. Zhao and Elizabeth S. Yang. 2016. Reliability analysis of SSDs under power fault. ACM Trans. Comput. Syst. 34 4 (2016).","DOI":"10.1145\/2992782"},{"key":"e_1_3_3_78_2","doi-asserted-by":"publisher","DOI":"10.1145\/3281031"},{"key":"e_1_3_3_79_2","unstructured":"AspectJ. 2001-now. https:\/\/www.eclipse.org\/aspectj\/."},{"key":"e_1_3_3_80_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2019.2946563"},{"key":"e_1_3_3_81_2","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2019.00035"},{"key":"e_1_3_3_82_2","volume-title":"2020 IEEE Symposium on Security and Privacy (SP\u201920)","author":"Xu Meng","year":"2020","unstructured":"Meng Xu, Sanidhya Kashyap, Hanqing Zhao, and Taesoo Kim. 2020. Krace: Data race fuzzing for kernel file systems. In 2020 IEEE Symposium on Security and Privacy (SP\u201920)."},{"key":"e_1_3_3_83_2","volume-title":"2019 IEEE Symposium on Security and Privacy (SP\u201919)","author":"Jeong Dae R.","year":"2019","unstructured":"Dae R. Jeong, Kyungtae Kim, Basavesh Shivakumar, Byoungyoung Lee, and Insik Shin. 2019. Razzer: Finding kernel race bugs through fuzzing. In 2019 IEEE Symposium on Security and Privacy (SP\u201919)."},{"key":"e_1_3_3_84_2","unstructured":"Colin Scott. 2015. Fuzzing raft for fun and publication. https:\/\/colin-scott.github.io\/blog\/2015\/10\/07\/fuzzing-raft-for-fun-and-profit\/."},{"key":"e_1_3_3_85_2","unstructured":"socket(2) \u2014 Linux manual page. 2020. https:\/\/man7.org\/linux\/man-pages\/man2\/socket.2.html."},{"key":"e_1_3_3_86_2","doi-asserted-by":"publisher","DOI":"10.1109\/DSNW.2011.5958811"},{"key":"e_1_3_3_87_2","doi-asserted-by":"publisher","DOI":"10.1145\/96267.96279"},{"key":"e_1_3_3_88_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICST46399.2020.00062"},{"key":"e_1_3_3_89_2","doi-asserted-by":"publisher","DOI":"10.1109\/PDSW51947.2020.00013"},{"key":"e_1_3_3_90_2","doi-asserted-by":"publisher","DOI":"10.1145\/2110356.2110360"},{"key":"e_1_3_3_91_2","doi-asserted-by":"publisher","DOI":"10.1145\/3133956.3134015"},{"key":"e_1_3_3_92_2","doi-asserted-by":"publisher","DOI":"10.1145\/3465332.3470873"},{"key":"e_1_3_3_93_2","doi-asserted-by":"publisher","DOI":"10.1145\/1629575.1629587"},{"key":"e_1_3_3_94_2","unstructured":"GitLab repository for PFault by Data Storage Lab@ISU. 2020. https:\/\/git.ece.iastate.edu\/data-storage-lab\/prototypes\/pfault."},{"key":"e_1_3_3_95_2","doi-asserted-by":"publisher","DOI":"10.1109\/MASCOTS.2013.19"},{"key":"e_1_3_3_96_2","first-page":"123","volume-title":"ACM SIGPLAN Notices","author":"Vetter Jeffrey S.","year":"2001","unstructured":"Jeffrey S. Vetter and Michael O. McCracken. 2001. Statistical scalability analysis of communication operations in distributed applications. In ACM SIGPLAN Notices, Vol. 36. ACM, 123\u2013132."},{"key":"e_1_3_3_97_2","unstructured":"HPC-5 Open Source Software project LANL-Trace. 2015. institutes.lanl.gov\/data\/tdata\/."},{"key":"e_1_3_3_98_2","doi-asserted-by":"publisher","DOI":"10.1145\/1374596.1374609"},{"key":"e_1_3_3_99_2","unstructured":"Michael P. Mesnier Matthew Wachs Raja R. Simbasivan Julio Lopez James Hendricks Gregory R. Ganger and David R. O\u2019Hallaron. 2007. \/\/Trace: Parallel trace replay with approximate causal events. USENIX."},{"key":"e_1_3_3_100_2","first-page":"1","volume-title":"IEEE International Symposium on Parallel and Distributed Processing, 2008 (IPDPS\u201908).","year":"2008","unstructured":"S. Seelam, I. Chung, D.-Y. Hong, H.-F. Wen, and H. Yu. 2008. Early experiences in application level I\/O tracing on blue gene systems. In IEEE International Symposium on Parallel and Distributed Processing, 2008 (IPDPS\u201908). IEEE, 1\u20138."},{"key":"e_1_3_3_101_2","unstructured":"Darshan:HPC I\/O Characterization Tool. 2017. http:\/\/www.mcs.anl.gov\/research\/projects\/darshan\/."},{"key":"e_1_3_3_102_2","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTR.2009.5289150"},{"key":"e_1_3_3_103_2","volume-title":"Proceedings of the 12th ACM Workshop on Hot Topics in Storage and File Systems (HotStorage\u201920)","author":"Sun Jinghan","year":"2020","unstructured":"Jinghan Sun, Chen Wang, Jian Huang, and Marc Snir. 2020. Understanding and finding crash-consistency bugs in parallel file systems. In Proceedings of the 12th ACM Workshop on Hot Topics in Storage and File Systems (HotStorage\u201920)."},{"key":"e_1_3_3_104_2","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2723711"},{"key":"e_1_3_3_105_2","unstructured":"Java Virtual Machine Tool Interface (JVM TI). https:\/\/docs.oracle.com\/javase\/8\/docs\/technotes\/guides\/jvmti\/."},{"key":"e_1_3_3_106_2","unstructured":"Trail: The Reflection API. https:\/\/docs.oracle.com\/javase\/tutorial\/reflect\/index.html."},{"key":"e_1_3_3_107_2","unstructured":"WALA home page. 2015. http:\/\/wala.sourceforge.net\/wiki\/index.php\/."},{"key":"e_1_3_3_108_2","unstructured":"Java bytecode engineering toolkit. 1999. https:\/\/www.javassist.org\/."},{"key":"e_1_3_3_109_2","unstructured":"The LLVM Compiler Infrastructure. 2020. https:\/\/llvm.org."},{"key":"e_1_3_3_110_2","first-page":"961","volume-title":"2019 USENIX Annual Technical Conference (USENIX ATC\u201919)","author":"Xu Erci","year":"2019","unstructured":"Erci Xu, Mai Zheng, Feng Qin, Yikang Xu, and Jiesheng Wu. 2019. Lessons and actions: What we learned from 10K SSD-related storage system failures. In 2019 USENIX Annual Technical Conference (USENIX ATC\u201919). USENIX Association, 961\u2013976. https:\/\/www.usenix.org\/conference\/atc19\/presentation\/xu."},{"key":"e_1_3_3_111_2","volume-title":"2018 IEEE\/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS\u201918)","author":"Xu Erci","year":"2018","unstructured":"Erci Xu, Mai Zheng, Feng Qin, Jiesheng Wu, and Yikang Xu. 2018. Understanding SSD reliability in large-scale cloud systems. In 2018 IEEE\/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS\u201918)."},{"key":"e_1_3_3_112_2","doi-asserted-by":"publisher","DOI":"10.1145\/2987550.2987583"},{"key":"e_1_3_3_113_2","first-page":"67","volume-title":"14th USENIX Conference on File and Storage Technologies (FAST\u201916)","author":"Lagisetty Arif Merchant Bianca Schroeder and Raghav","year":"2016","unstructured":"Arif Merchant Bianca Schroeder and Raghav Lagisetty. 2016. Flash reliability in production: The expected and the unexpected. In 14th USENIX Conference on File and Storage Technologies (FAST\u201916). USENIX Association, 67\u201380. https:\/\/www.usenix.org\/conference\/fast16\/technical-sessions\/presentation\/schroeder."},{"key":"e_1_3_3_114_2","doi-asserted-by":"publisher","DOI":"10.5555\/2591272.2591300"},{"key":"e_1_3_3_115_2","doi-asserted-by":"publisher","DOI":"10.1145\/3456727.3463783"},{"key":"e_1_3_3_116_2","first-page":"131","volume-title":"Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI\u201906)","author":"Yang Junfeng","year":"2006","unstructured":"Junfeng Yang, Can Sar, and Dawson Engler. 2006. EXPLODE: A lightweight, general system for finding serious storage system errors. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI\u201906). 131\u2013146."},{"key":"e_1_3_3_117_2","doi-asserted-by":"publisher","DOI":"10.5555\/1855807.1855814"},{"key":"e_1_3_3_118_2","first-page":"31","volume-title":"Presented as Part of the 11th USENIX Conference on File and Storage Technologies (FAST\u201913)","author":"Lu Lanyue","year":"2013","unstructured":"Lanyue Lu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Shan Lu. 2013. A study of Linux file system evolution. In Presented as Part of the 11th USENIX Conference on File and Storage Technologies (FAST\u201913). USENIX, 31\u201344. https:\/\/www.usenix.org\/conference\/fast13\/technical-sessions\/presentation\/lu."},{"key":"e_1_3_3_119_2","doi-asserted-by":"publisher","DOI":"10.1145\/3281031"},{"key":"e_1_3_3_120_2","doi-asserted-by":"publisher","DOI":"10.5555\/3154601.3154608"},{"key":"e_1_3_3_121_2","doi-asserted-by":"publisher","DOI":"10.1145\/2815400.2815402"}],"container-title":["ACM Transactions on Storage"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3483447","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3483447","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3483447","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:18:58Z","timestamp":1750191538000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3483447"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,4,28]]},"references-count":120,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,5,31]]}},"alternative-id":["10.1145\/3483447"],"URL":"https:\/\/doi.org\/10.1145\/3483447","relation":{},"ISSN":["1553-3077","1553-3093"],"issn-type":[{"value":"1553-3077","type":"print"},{"value":"1553-3093","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,4,28]]},"assertion":[{"value":"2021-02-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-04-28","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}