{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T20:59:05Z","timestamp":1775854745754,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":89,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,10,23]],"date-time":"2023-10-23T00:00:00Z","timestamp":1698019200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"NSF (National Science Foundation)","doi-asserted-by":"publisher","award":["CNS-2214272"],"award-info":[{"award-number":["CNS-2214272"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"NSF (National Science Foundation)","doi-asserted-by":"publisher","award":["CNS-1815525"],"award-info":[{"award-number":["CNS-1815525"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,10,23]]},"DOI":"10.1145\/3600006.3613145","type":"proceedings-article","created":{"date-parts":[[2023,10,3]],"date-time":"2023-10-03T14:44:17Z","timestamp":1696344257000},"page":"364-381","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":58,"title":["GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1614-0601","authenticated-orcid":false,"given":"Zhuang","family":"Wang","sequence":"first","affiliation":[{"name":"Department of Computer Science, Rice University, Houston, Texas, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3543-2324","authenticated-orcid":false,"given":"Zhen","family":"Jia","sequence":"additional","affiliation":[{"name":"Amazon Web Services, Santa Clara, California, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3093-6486","authenticated-orcid":false,"given":"Shuai","family":"Zheng","sequence":"additional","affiliation":[{"name":"Amazon Web Services, Santa Clara, California, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0164-0849","authenticated-orcid":false,"given":"Zhen","family":"Zhang","sequence":"additional","affiliation":[{"name":"Amazon Web Services, Santa Clara, California, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-7822-5450","authenticated-orcid":false,"given":"Xinwei","family":"Fu","sequence":"additional","affiliation":[{"name":"Amazon Web Services, Santa Clara, California, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2954-0767","authenticated-orcid":false,"given":"T. S. Eugene","family":"Ng","sequence":"additional","affiliation":[{"name":"Rice University, Houston, Texas, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8165-840X","authenticated-orcid":false,"given":"Yida","family":"Wang","sequence":"additional","affiliation":[{"name":"Amazon Web Services, Santa Clara, California, United States of America"}]}],"member":"320","published-online":{"date-parts":[[2023,10,23]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"https:\/\/developer.nvidia.com\/NCCL","author":"NVIDIA","year":"2021","unstructured":"NVIDIA NCCL. https:\/\/developer.nvidia.com\/NCCL , 2021 . NVIDIA NCCL. https:\/\/developer.nvidia.com\/NCCL, 2021."},{"key":"e_1_3_2_1_2_1","volume-title":"https:\/\/aws.amazon.com\/machine-learning\/trainium\/","author":"Trainium AWS","year":"2022","unstructured":"AWS Trainium . https:\/\/aws.amazon.com\/machine-learning\/trainium\/ , 2022 . AWS Trainium. https:\/\/aws.amazon.com\/machine-learning\/trainium\/, 2022."},{"key":"e_1_3_2_1_3_1","volume-title":"https:\/\/github.com\/bigscience-workshop\/bigscience\/blob\/master\/train\/tr11-176B-ml\/chronicles.md","author":"Chronicles BLOOM","year":"2022","unstructured":"BLOOM Chronicles . https:\/\/github.com\/bigscience-workshop\/bigscience\/blob\/master\/train\/tr11-176B-ml\/chronicles.md , 2022 . BLOOM Chronicles. https:\/\/github.com\/bigscience-workshop\/bigscience\/blob\/master\/train\/tr11-176B-ml\/chronicles.md, 2022."},{"key":"e_1_3_2_1_4_1","volume-title":"https:\/\/docs.aws.amazon.com\/autoscaling\/","author":"Auto Scaling","year":"2023","unstructured":"Auto Scaling in AWS. https:\/\/docs.aws.amazon.com\/autoscaling\/ , 2023 . Auto Scaling in AWS. https:\/\/docs.aws.amazon.com\/autoscaling\/, 2023."},{"key":"e_1_3_2_1_5_1","volume-title":"https:\/\/learn.microsoft.com\/en-us\/azure\/app-service\/manage-scale-up","author":"Azure Auto Scaling","year":"2023","unstructured":"Auto Scaling in Azure . https:\/\/learn.microsoft.com\/en-us\/azure\/app-service\/manage-scale-up , 2023 . Auto Scaling in Azure. https:\/\/learn.microsoft.com\/en-us\/azure\/app-service\/manage-scale-up, 2023."},{"key":"e_1_3_2_1_6_1","volume-title":"https:\/\/cloud.google.com\/compute\/docs\/autoscaler","author":"Google Cloud Auto Scaling","year":"2023","unstructured":"Auto Scaling in Google Cloud . https:\/\/cloud.google.com\/compute\/docs\/autoscaler , 2023 . Auto Scaling in Google Cloud. https:\/\/cloud.google.com\/compute\/docs\/autoscaler, 2023."},{"key":"e_1_3_2_1_7_1","volume-title":"https:\/\/aws.amazon.com\/fsx\/","author":"Sx AWS","year":"2023","unstructured":"AWS F Sx . https:\/\/aws.amazon.com\/fsx\/ , 2023 . AWS FSx. https:\/\/aws.amazon.com\/fsx\/, 2023."},{"key":"e_1_3_2_1_8_1","volume-title":"https:\/\/openai.com\/blog\/chatgpt","author":"GPT.","year":"2023","unstructured":"Chat GPT. https:\/\/openai.com\/blog\/chatgpt , 2023 . ChatGPT. https:\/\/openai.com\/blog\/chatgpt, 2023."},{"key":"e_1_3_2_1_9_1","volume-title":"https:\/\/etcd.io\/","year":"2023","unstructured":"etcd. https:\/\/etcd.io\/ , 2023 . etcd. https:\/\/etcd.io\/, 2023."},{"key":"e_1_3_2_1_10_1","volume-title":"https:\/\/cloud.google.com\/compute\/docs\/gpus","author":"Google Cloud Platform GPU","year":"2023","unstructured":"GPU instances in Google Cloud Platform . https:\/\/cloud.google.com\/compute\/docs\/gpus , 2023 . GPU instances in Google Cloud Platform. https:\/\/cloud.google.com\/compute\/docs\/gpus, 2023."},{"key":"e_1_3_2_1_11_1","volume-title":"https:\/\/learn.microsoft.com\/en-us\/azure\/virtual-machines\/ndv2-series","year":"2023","unstructured":"ND40rs_v2 in Azure. https:\/\/learn.microsoft.com\/en-us\/azure\/virtual-machines\/ndv2-series , 2023 . ND40rs_v2 in Azure. https:\/\/learn.microsoft.com\/en-us\/azure\/virtual-machines\/ndv2-series, 2023."},{"key":"e_1_3_2_1_12_1","volume-title":"https:\/\/learn.microsoft.com\/en-us\/azure\/virtual-machines\/nda100-v4-series","year":"2023","unstructured":"ND96asr_v4 in Azure. https:\/\/learn.microsoft.com\/en-us\/azure\/virtual-machines\/nda100-v4-series , 2023 . ND96asr_v4 in Azure. https:\/\/learn.microsoft.com\/en-us\/azure\/virtual-machines\/nda100-v4-series, 2023."},{"key":"e_1_3_2_1_13_1","volume-title":"https:\/\/www.nvidia.com\/en-gb\/data-center\/dgx-a100\/","author":"NVIDIA DGX","year":"2023","unstructured":"NVIDIA DGX A100. https:\/\/www.nvidia.com\/en-gb\/data-center\/dgx-a100\/ , 2023 . NVIDIA DGX A100. https:\/\/www.nvidia.com\/en-gb\/data-center\/dgx-a100\/, 2023."},{"key":"e_1_3_2_1_14_1","volume-title":"https:\/\/github.com\/facebookresearch\/metaseq\/tree\/main\/projects\/OPT\/chronicles","author":"B","year":"2023","unstructured":"OPT-175 B logbook. https:\/\/github.com\/facebookresearch\/metaseq\/tree\/main\/projects\/OPT\/chronicles , 2023 . OPT-175B logbook. https:\/\/github.com\/facebookresearch\/metaseq\/tree\/main\/projects\/OPT\/chronicles, 2023."},{"key":"e_1_3_2_1_15_1","volume-title":"https:\/\/aws.amazon.com\/ec2\/instance-types\/p3\/","author":"AWS.","year":"2023","unstructured":"P3dn.24xlarge in AWS. https:\/\/aws.amazon.com\/ec2\/instance-types\/p3\/ , 2023 . P3dn.24xlarge in AWS. https:\/\/aws.amazon.com\/ec2\/instance-types\/p3\/, 2023."},{"key":"e_1_3_2_1_16_1","volume-title":"https:\/\/aws.amazon.com\/ec2\/instance-types\/p4\/","author":"AWS.","year":"2023","unstructured":"P4d.24xlarge in AWS. https:\/\/aws.amazon.com\/ec2\/instance-types\/p4\/ , 2023 . P4d.24xlarge in AWS. https:\/\/aws.amazon.com\/ec2\/instance-types\/p4\/, 2023."},{"key":"e_1_3_2_1_17_1","volume-title":"https:\/\/docs.aws.amazon.com\/sagemaker\/index.html","year":"2023","unstructured":"SageMaker. https:\/\/docs.aws.amazon.com\/sagemaker\/index.html , 2023 . SageMaker. https:\/\/docs.aws.amazon.com\/sagemaker\/index.html, 2023."},{"key":"e_1_3_2_1_18_1","first-page":"265","volume-title":"USENIX Symposium on Operating Systems Design and Implementation (OSDI)","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi , Paul Barham , Jianmin Chen , Zhifeng Chen , Andy Davis , Jeffrey Dean , Matthieu Devin , Sanjay Ghemawat , Geoffrey Irving , Michael Isard , Manjunath Kudlur , Josh Levenberg , Rajat Monga , Sherry Moore , Derek G. Murray , Benoit Steiner , Paul Tucker , Vijay Vasudevan , Pete Warden , Martin Wicke , Yuan Yu , and Xiaoqiang Zheng . Tensorflow : A system for large-scale machine learning . In USENIX Symposium on Operating Systems Design and Implementation (OSDI) , pages 265 -- 283 , 2016 . Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 265--283, 2016."},{"key":"e_1_3_2_1_19_1","volume-title":"NSDI","author":"Agarwal Sharad","year":"2010","unstructured":"Sharad Agarwal , John Dunagan , Navendu Jain , Stefan Saroiu , Alec Wolman , and Habinder Bhogan . Volley : Automated data placement for geo-distributed cloud services . In NSDI , 2010 . Sharad Agarwal, John Dunagan, Navendu Jain, Stefan Saroiu, Alec Wolman, and Habinder Bhogan. Volley: Automated data placement for geo-distributed cloud services. In NSDI, 2010."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3492321.3519584"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3236367.3236381"},{"key":"e_1_3_2_1_22_1","volume-title":"MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274","author":"Chen Tianqi","year":"2015","unstructured":"Tianqi Chen , Mu Li , Yutian Li , Min Lin , Naiyan Wang , Minjie Wang , Tianjun Xiao , Bing Xu , Chiyuan Zhang , and Zheng Zhang . MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 , 2015 . Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015."},{"key":"e_1_3_2_1_23_1","first-page":"1627","volume-title":"International Conference on Machine Learning","author":"Chen Yu","year":"2020","unstructured":"Yu Chen , Zhenming Liu , Bin Ren , and Xin Jin . On efficient constructions of checkpoints . In International Conference on Machine Learning , pages 1627 -- 1636 . PMLR, 2020 . Yu Chen, Zhenming Liu, Bin Ren, and Xin Jin. On efficient constructions of checkpoints. In International Conference on Machine Learning, pages 1627--1636. PMLR, 2020."},{"key":"e_1_3_2_1_24_1","volume-title":"Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311","author":"Chowdhery Aakanksha","year":"2022","unstructured":"Aakanksha Chowdhery , Sharan Narang , Jacob Devlin , Maarten Bosma , Gaurav Mishra , Adam Roberts , Paul Barham , Hyung Won Chung , Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 , 2022 . Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022."},{"key":"e_1_3_2_1_25_1","first-page":"37","volume-title":"2013 USENIX Annual Technical Conference (USENIX ATC 13)","author":"Cidon Asaf","year":"2013","unstructured":"Asaf Cidon , Stephen Rumble , Ryan Stutsman , Sachin Katti , John Ousterhout , and Mendel Rosenblum . Copysets : Reducing the frequency of data loss in cloud storage . In 2013 USENIX Annual Technical Conference (USENIX ATC 13) , pages 37 -- 48 , 2013 . Asaf Cidon, Stephen Rumble, Ryan Stutsman, Sachin Katti, John Ousterhout, and Mendel Rosenblum. Copysets: Reducing the frequency of data loss in cloud storage. In 2013 USENIX Annual Technical Conference (USENIX ATC 13), pages 37--48, 2013."},{"key":"e_1_3_2_1_26_1","volume-title":"BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018 . Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018."},{"key":"e_1_3_2_1_27_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018 . Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018."},{"key":"e_1_3_2_1_28_1","first-page":"929","volume-title":"19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)","author":"Eisenman Assaf","year":"2022","unstructured":"Assaf Eisenman , Kiran Kumar Matam , Steven Ingram , Dheevatsa Mudigere , Raghuraman Krishnamoorthi , Krishnakumar Nair , Misha Smelyanskiy , and Murali Annavaram . Check-N-Run : a checkpointing system for training deep learning recommendation models . In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) , pages 929 -- 943 , 2022 . Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. Check-N-Run: a checkpointing system for training deep learning recommendation models. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 929--943, 2022."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2342356.2342360"},{"key":"e_1_3_2_1_30_1","first-page":"323","volume-title":"OSDI","author":"Geambasu Roxana","year":"2010","unstructured":"Roxana Geambasu , Amit A Levy , Tadayoshi Kohno , Arvind Krishnamurthy , and Henry M Levy . Comet : An active distributed key-value store . In OSDI , pages 323 -- 336 , 2010 . Roxana Geambasu, Amit A Levy, Tadayoshi Kohno, Arvind Krishnamurthy, and Henry M Levy. Comet: An active distributed key-value store. In OSDI, pages 323--336, 2010."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2018436.2018477"},{"key":"e_1_3_2_1_32_1","first-page":"418","article-title":"Accelerating distributed deep learning with communication scheduling","volume":"1","author":"Hashemi Sayed Hadi","year":"2019","unstructured":"Sayed Hadi Hashemi , Sangeetha Abdu Jyothi , and Roy Campbell . TicTac : Accelerating distributed deep learning with communication scheduling . Proceedings of Machine Learning and Systems , 1 : 418 -- 430 , 2019 . Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy Campbell. TicTac: Accelerating distributed deep learning with communication scheduling. Proceedings of Machine Learning and Systems, 1:418--430, 2019.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_34_1","volume-title":"GPipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang , Youlong Cheng , Ankur Bapna , Orhan Firat , Dehao Chen , Mia Chen , HyoukJoong Lee , Jiquan Ngiam , Quoc V Le , Yonghui Wu , GPipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32 , 2019 . Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. GPipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019."},{"key":"e_1_3_2_1_35_1","first-page":"132","article-title":"Priority-based parameter propagation for distributed DNN training","volume":"1","author":"Jayarajan Anand","year":"2019","unstructured":"Anand Jayarajan , Jinliang Wei , Garth Gibson , Alexandra Fedorova , and Gennady Pekhimenko . Priority-based parameter propagation for distributed DNN training . Proceedings of Machine Learning and Systems , 1 : 132 -- 145 , 2019 . Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. Priority-based parameter propagation for distributed DNN training. Proceedings of Machine Learning and Systems, 1:132--145, 2019.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_36_1","first-page":"947","volume-title":"USENIX Annual Technical Conference","author":"Jeon Myeongjae","year":"2019","unstructured":"Myeongjae Jeon , Shivaram Venkataraman , Amar Phanishayee , Junjie Qian , Wencong Xiao , and Fan Yang . Analysis of large-scale multitenant GPU clusters for DNN training workloads . In USENIX Annual Technical Conference , pages 947 -- 960 , 2019 . Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. Analysis of large-scale multitenant GPU clusters for DNN training workloads. In USENIX Annual Technical Conference, pages 947--960, 2019."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.5555\/3488766.3488792"},{"key":"e_1_3_2_1_38_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba . Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014 . Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014."},{"key":"e_1_3_2_1_39_1","volume-title":"Reducing activation recomputation in large transformer models. arXiv preprint arXiv:2205.05198","author":"Korthikanti Vijay","year":"2022","unstructured":"Vijay Korthikanti , Jared Casper , Sangkug Lym , Lawrence McAfee , Michael Andersch , Mohammad Shoeybi , and Bryan Catanzaro . Reducing activation recomputation in large transformer models. arXiv preprint arXiv:2205.05198 , 2022 . Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. arXiv preprint arXiv:2205.05198, 2022."},{"key":"e_1_3_2_1_40_1","first-page":"51","volume-title":"December 2001)","author":"Lamport Leslie","year":"2001","unstructured":"Leslie Lamport . Paxos made simple. ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121 , December 2001) , pages 51 -- 58 , 2001 . Leslie Lamport. Paxos made simple. ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001), pages 51--58, 2001."},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2019.2928289"},{"key":"e_1_3_2_1_42_1","volume-title":"Proceedings of the VLDB Endowment, 13(12)","author":"Li Shen","unstructured":"Shen Li , Yanli Zhao , Rohan Varma , Omkar Salpekar , Pieter Noordhuis , Teng Li , Adam Paszke , Jeff Smith , Brian Vaughan , Pritam Damania , Pytorch distributed : Experiences on accelerating data parallel training . Proceedings of the VLDB Endowment, 13(12) . Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch distributed: Experiences on accelerating data parallel training. Proceedings of the VLDB Endowment, 13(12)."},{"key":"e_1_3_2_1_43_1","volume-title":"Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu , Myle Ott , Naman Goyal , Jingfei Du , Mandar Joshi , Danqi Chen , Omer Levy , Mike Lewis , Luke Zettlemoyer , and Veselin Stoyanov . Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 , 2019 . Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019."},{"key":"e_1_3_2_1_44_1","first-page":"637","article-title":"Understanding and improving failure tolerant training for deep learning recommendation with partial recovery","volume":"3","author":"Maeng Kiwan","year":"2021","unstructured":"Kiwan Maeng , Shivam Bharuka , Isabel Gao , Mark Jeffrey , Vikram Saraph , Bor-Yiing Su , Caroline Trippel , Jiyan Yang , Mike Rabbat , Brandon Lucia , Understanding and improving failure tolerant training for deep learning recommendation with partial recovery . Proceedings of Machine Learning and Systems , 3 : 637 -- 651 , 2021 . Kiwan Maeng, Shivam Bharuka, Isabel Gao, Mark Jeffrey, Vikram Saraph, Bor-Yiing Su, Caroline Trippel, Jiyan Yang, Mike Rabbat, Brandon Lucia, et al. Understanding and improving failure tolerant training for deep learning recommendation with partial recovery. Proceedings of Machine Learning and Systems, 3:637--651, 2021.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_45_1","first-page":"937","volume-title":"Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation","author":"Mai Luo","year":"2020","unstructured":"Luo Mai , Guo Li , Marcel Wagenl\u00e4nder , Konstantinos Fertakis , Andrei-Octavian Brabete , and Peter Pietzuch . Kungfu : Making training in distributed machine learning adaptive . In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation , pages 937 -- 954 , 2020 . Luo Mai, Guo Li, Marcel Wagenl\u00e4nder, Konstantinos Fertakis, Andrei-Octavian Brabete, and Peter Pietzuch. Kungfu: Making training in distributed machine learning adaptive. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation, pages 937--954, 2020."},{"key":"e_1_3_2_1_46_1","volume-title":"Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843","author":"Merity Stephen","year":"2016","unstructured":"Stephen Merity , Caiming Xiong , James Bradbury , and Richard Socher . Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843 , 2016 . Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016."},{"key":"e_1_3_2_1_47_1","first-page":"1928","volume-title":"International conference on machine learning","author":"Mnih Volodymyr","year":"2016","unstructured":"Volodymyr Mnih , Adria Puigdomenech Badia , Mehdi Mirza , Alex Graves , Timothy Lillicrap , Tim Harley , David Silver , and Koray Kavukcuoglu . Asynchronous methods for deep reinforcement learning . In International conference on machine learning , pages 1928 -- 1937 . PMLR, 2016 . Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928--1937. PMLR, 2016."},{"key":"e_1_3_2_1_48_1","first-page":"203","volume-title":"FAST","volume":"21","author":"Mohan Jayashree","year":"2021","unstructured":"Jayashree Mohan , Amar Phanishayee , and Vijay Chidambaram . Check-freq : Frequent, fine-grained DNN checkpointing . In FAST , volume 21 , pages 203 -- 216 , 2021 . Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. Check-freq: Frequent, fine-grained DNN checkpointing. In FAST, volume 21, pages 203--216, 2021."},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/CCGrid49817.2020.00-76"},{"key":"e_1_3_2_1_52_1","first-page":"305","volume-title":"2014 USENIX Annual Technical Conference","author":"Ongaro Diego","year":"2014","unstructured":"Diego Ongaro and John Ousterhout . In search of an understandable consensus algorithm . In 2014 USENIX Annual Technical Conference , pages 305 -- 319 , 2014 . Diego Ongaro and John Ousterhout. In search of an understandable consensus algorithm. In 2014 USENIX Annual Technical Conference, pages 305--319, 2014."},{"key":"e_1_3_2_1_53_1","volume-title":"GPT-4 technical report. arXiv preprint arXiv:2303.08774","author":"AI.","year":"2023","unstructured":"Open AI. GPT-4 technical report. arXiv preprint arXiv:2303.08774 , 2023 . OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023."},{"key":"e_1_3_2_1_54_1","first-page":"400","article-title":"Resource elasticity in distributed deep learning","volume":"2","author":"Or Andrew","year":"2020","unstructured":"Andrew Or , Haoyu Zhang , and Michael Freedman . Resource elasticity in distributed deep learning . Proceedings of Machine Learning and Systems , 2 : 400 -- 411 , 2020 . Andrew Or, Haoyu Zhang, and Michael Freedman. Resource elasticity in distributed deep learning. Proceedings of Machine Learning and Systems, 2:400--411, 2020.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00045"},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2009.191"},{"key":"e_1_3_2_1_57_1","first-page":"8024","volume-title":"Advances in Neural Information Processing Systems","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke , Sam Gross , Francisco Massa , Adam Lerer , James Bradbury , Gregory Chanan , Trevor Killeen , Zeming Lin , Natalia Gimelshein , Luca Antiga , Alban Desmaison , Andreas Kopf , Edward Yang , Zachary DeVito , Martin Raison , Alykhan Tejani , Sasank Chilamkurthy , Benoit Steiner , Lu Fang , Junjie Bai , and Soumith Chintala . PyTorch : An imperative style, high-performance deep learning library . In Advances in Neural Information Processing Systems , pages 8024 -- 8035 , 2019 . Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024--8035, 2019."},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359642"},{"key":"e_1_3_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/71.730527"},{"key":"e_1_3_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2013.17"},{"issue":"8","key":"e_1_3_2_1_61_1","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford , Jeffrey Wu , Rewon Child , David Luan , Dario Amodei , Ilya Sutskever , Language models are unsupervised multitask learners . OpenAI blog , 1 ( 8 ): 9 , 2019 . Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.","journal-title":"OpenAI blog"},{"key":"e_1_3_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_3_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3406703"},{"key":"e_1_3_2_1_64_1","first-page":"185","volume-title":"GPU Technology Conference","author":"Rossetti Davide","year":"2015","unstructured":"Davide Rossetti and S Team . GPUDirect : Integrating the GPU with a network interface . In GPU Technology Conference , page 185 , 2015 . Davide Rossetti and S Team. GPUDirect: Integrating the GPU with a network interface. In GPU Technology Conference, page 185, 2015."},{"key":"e_1_3_2_1_65_1","volume-title":"Fran\u00e7ois Yvon, Matthias Gall\u00e9, et al. BLOOM: A 176B-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100","author":"Scao Teven Le","year":"2022","unstructured":"Teven Le Scao , Angela Fan , Christopher Akiki , Ellie Pavlick , Suzana Ili\u0107 , Daniel Hesslow , Roman Castagn\u00e9 , Alexandra Sasha Luccioni , Fran\u00e7ois Yvon, Matthias Gall\u00e9, et al. BLOOM: A 176B-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 , 2022 . Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili\u0107, Daniel Hesslow, Roman Castagn\u00e9, Alexandra Sasha Luccioni, Fran\u00e7ois Yvon, Matthias Gall\u00e9, et al. BLOOM: A 176B-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022."},{"key":"e_1_3_2_1_66_1","volume-title":"Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799","author":"Sergeev Alexander","year":"2018","unstructured":"Alexander Sergeev and Mike Del Balso . Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799 , 2018 . Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018."},{"key":"e_1_3_2_1_67_1","volume-title":"Megatron-LM: Training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi , Mostofa Patwary , Raul Puri , Patrick LeGresley , Jared Casper , and Bryan Catanzaro . Megatron-LM: Training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 , 2019 . Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019."},{"key":"e_1_3_2_1_68_1","volume-title":"Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. arXiv preprint arXiv:2201.11990","author":"Smith Shaden","year":"2022","unstructured":"Shaden Smith , Mostofa Patwary , Brandon Norick , Patrick LeGresley , Samyam Rajbhandari , Jared Casper , Zhun Liu , Shrimai Prabhumoye , George Zerveas , Vijay Korthikanti , Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. arXiv preprint arXiv:2201.11990 , 2022 . Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022."},{"key":"e_1_3_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.308"},{"key":"e_1_3_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN48987.2021.00043"},{"key":"e_1_3_2_1_71_1","first-page":"599","volume-title":"NSDI","author":"Tan Cheng","year":"2019","unstructured":"Cheng Tan , Ze Jin , Chuanxiong Guo , Tianrong Zhang , Haitao Wu , Karl Deng , Dongming Bi , and Dong Xiang . NetBouncer : Active device and link failure localization in data center networks . In NSDI , pages 599 -- 614 , 2019 . Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang. NetBouncer: Active device and link failure localization in data center networks. In NSDI, pages 599--614, 2019."},{"issue":"1","key":"e_1_3_2_1_72_1","article-title":"Optimization of collective communication operations","volume":"19","author":"Thakur Rajeev","year":"2005","unstructured":"Rajeev Thakur , Rolf Rabenseifner , and William Gropp . Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications , 19 ( 1 ), 2005 . Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications, 19(1), 2005.","journal-title":"MPICH. The International Journal of High Performance Computing Applications"},{"key":"e_1_3_2_1_73_1","volume-title":"NSDI","author":"Thorpe John","year":"2023","unstructured":"John Thorpe , Pengzhan Zhao , Jonathan Eyolfson , Yifan Qiao , Zhihao Jia , Minjia Zhang , Ravi Netravali , and Guoqing Harry Xu. Bamboo : Making preemptible instances resilient for affordable training of large DNNs . In NSDI , 2023 . John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, and Guoqing Harry Xu. Bamboo: Making preemptible instances resilient for affordable training of large DNNs. In NSDI, 2023."},{"key":"e_1_3_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1145\/2807591.2807666"},{"key":"e_1_3_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2015.7056044"},{"key":"e_1_3_2_1_76_1","first-page":"5998","volume-title":"Advances in neural information processing systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , \u0141ukasz Kaiser , and Illia Polosukhin . Attention is all you need . In Advances in neural information processing systems , pages 5998 -- 6008 , 2017 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998--6008, 2017."},{"key":"e_1_3_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/3552326.3567505"},{"key":"e_1_3_2_1_78_1","first-page":"5","article-title":"A compression scheduler for scalable communication-efficient distributed training","author":"Wang Zhuang","year":"2023","unstructured":"Zhuang Wang , Xinyu Wu , Zhaozhuo Xu , and T. S. Eugene Ng . Cupcake : A compression scheduler for scalable communication-efficient distributed training . Proceedings of Machine Learning and Systems , 5 , 2023 . Zhuang Wang, Xinyu Wu, Zhaozhuo Xu, and T. S. Eugene Ng. Cupcake: A compression scheduler for scalable communication-efficient distributed training. Proceedings of Machine Learning and Systems, 5, 2023.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_79_1","first-page":"23274","volume-title":"International Conference on Machine Learning","author":"Wang Zhuang","year":"2022","unstructured":"Zhuang Wang , Zhaozhuo Xu , Xinyu Wu , Anshumali Shrivastava , and T. S. Eugene Ng . DRAGONN: Distributed randomized approximate gradients of neural networks . In International Conference on Machine Learning , pages 23274 -- 23291 . PMLR, 2022 . Zhuang Wang, Zhaozhuo Xu, Xinyu Wu, Anshumali Shrivastava, and T. S. Eugene Ng. DRAGONN: Distributed randomized approximate gradients of neural networks. In International Conference on Machine Learning, pages 23274--23291. PMLR, 2022."},{"key":"e_1_3_2_1_80_1","first-page":"122","volume-title":"Proceedings of the 2006 ACM\/IEEE conference on Supercomputing","author":"Weil Sage A","unstructured":"Sage A Weil , Scott A Brandt , Ethan L Miller , and Carlos Maltzahn . CRUSH : Controlled, scalable, decentralized placement of replicated data . In Proceedings of the 2006 ACM\/IEEE conference on Supercomputing , pages 122 --es, 2006. Sage A Weil, Scott A Brandt, Ethan L Miller, and Carlos Maltzahn. CRUSH: Controlled, scalable, decentralized placement of replicated data. In Proceedings of the 2006 ACM\/IEEE conference on Supercomputing, pages 122--es, 2006."},{"key":"e_1_3_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3064966"},{"key":"e_1_3_2_1_82_1","first-page":"595","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Xiao Wencong","year":"2018","unstructured":"Wencong Xiao , Romil Bhardwaj , Ramachandran Ramjee , Muthian Sivathanu , Nipun Kwatra , Zhenhua Han , Pratyush Patel , Xuan Peng , Hanyu Zhao , Quanlu Zhang , : Introspective cluster scheduling for deep learning . In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) , pages 595 -- 610 , 2018 . Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 595--610, 2018."},{"key":"e_1_3_2_1_83_1","first-page":"269","article-title":"Asynchronous pipeline parallel DNN training","volume":"3","author":"Yang Bowen","year":"2021","unstructured":"Bowen Yang , Jian Zhang , Jonathan Li , Christopher R\u00e9 , Christopher Aberger , and Christopher De Sa. PipeMare : Asynchronous pipeline parallel DNN training . Proceedings of Machine Learning and Systems , 3 : 269 -- 296 , 2021 . Bowen Yang, Jian Zhang, Jonathan Li, Christopher R\u00e9, Christopher Aberger, and Christopher De Sa. PipeMare: Asynchronous pipeline parallel DNN training. Proceedings of Machine Learning and Systems, 3:269--296, 2021.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_84_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2013.6638950"},{"key":"e_1_3_2_1_85_1","volume-title":"Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068","author":"Zhang Susan","year":"2022","unstructured":"Susan Zhang , Stephen Roller , Naman Goyal , Mikel Artetxe , Moya Chen , Shuohui Chen , Christopher Dewan , Mona Diab , Xian Li , Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 , 2022 . Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022."},{"key":"e_1_3_2_1_86_1","doi-asserted-by":"publisher","DOI":"10.1145\/3405671.3405810"},{"key":"e_1_3_2_1_87_1","doi-asserted-by":"publisher","DOI":"10.14778\/3561261.3561265"},{"key":"e_1_3_2_1_88_1","first-page":"93","volume-title":"an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In 2004 ieee international conference on cluster computing (ieee cat. no. 04EX935)","author":"Zheng Gengbin","year":"2004","unstructured":"Gengbin Zheng , Lixia Shi , and Laxmikant V Kal\u00e9 . FTC-Charm++ : an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In 2004 ieee international conference on cluster computing (ieee cat. no. 04EX935) , pages 93 -- 103 . IEEE , 2004 . Gengbin Zheng, Lixia Shi, and Laxmikant V Kal\u00e9. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In 2004 ieee international conference on cluster computing (ieee cat. no. 04EX935), pages 93--103. IEEE, 2004."},{"key":"e_1_3_2_1_89_1","first-page":"559","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng , Zhuohan Li , Hao Zhang , Yonghao Zhuang , Zhifeng Chen , Yanping Huang , Yida Wang , Yuanzhong Xu , Danyang Zhuo , Eric P. Xing , Joseph E. Gonzalez , and Ion Stoica . Alpa : Automating inter- and Intra-Operator parallelism for distributed deep learning . In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages 559 -- 578 , Carlsbad, CA , July 2022 . USENIX Association. Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559--578, Carlsbad, CA, July 2022. USENIX Association."}],"event":{"name":"SOSP '23: 29th Symposium on Operating Systems Principles","location":"Koblenz Germany","acronym":"SOSP '23","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems","USENIX"]},"container-title":["Proceedings of the 29th Symposium on Operating Systems Principles"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3600006.3613145","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/abs\/10.1145\/3600006.3613145","content-type":"text\/html","content-version":"vor","intended-application":"syndication"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:36:49Z","timestamp":1750178209000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3600006.3613145"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,23]]},"references-count":89,"alternative-id":["10.1145\/3600006.3613145","10.1145\/3600006"],"URL":"https:\/\/doi.org\/10.1145\/3600006.3613145","relation":{},"subject":[],"published":{"date-parts":[[2023,10,23]]},"assertion":[{"value":"2023-10-23","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}