{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,2]],"date-time":"2026-04-02T09:40:59Z","timestamp":1775122859512,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":55,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,10,23]],"date-time":"2023-10-23T00:00:00Z","timestamp":1698019200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"NSF (National Science Foundation)","doi-asserted-by":"publisher","award":["1909067"],"award-info":[{"award-number":["1909067"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"NSF (National Science Foundation)","doi-asserted-by":"publisher","award":["2104243"],"award-info":[{"award-number":["2104243"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"NSF (National Science Foundation)","doi-asserted-by":"publisher","award":["2106184"],"award-info":[{"award-number":["2106184"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,10,23]]},"DOI":"10.1145\/3600006.3613152","type":"proceedings-article","created":{"date-parts":[[2023,10,3]],"date-time":"2023-10-03T14:44:17Z","timestamp":1696344257000},"page":"382-395","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":24,"title":["Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-5206-2333","authenticated-orcid":false,"given":"Insu","family":"Jang","sequence":"first","affiliation":[{"name":"University of Michigan, Ann Arbor, MI, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-0813-5911","authenticated-orcid":false,"given":"Zhenning","family":"Yang","sequence":"additional","affiliation":[{"name":"University of Michigan, Ann Arbor, MI, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0164-0849","authenticated-orcid":false,"given":"Zhen","family":"Zhang","sequence":"additional","affiliation":[{"name":"Amazon Web Services, Santa Clara, CA, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8741-5847","authenticated-orcid":false,"given":"Xin","family":"Jin","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0884-6740","authenticated-orcid":false,"given":"Mosharaf","family":"Chowdhury","sequence":"additional","affiliation":[{"name":"University of Michigan, Ann Arbor, MI, United States of America"}]}],"member":"320","published-online":{"date-parts":[[2023,10,23]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3492321.3519584"},{"key":"e_1_3_2_2_2_1","unstructured":"Romain Beaumont. 2022. Large Scale OpenCLIP: L\/14 H\/14 AND G\/14 Trained on LAION-2B. https:\/\/laion.ai\/blog\/large-openclip\/  Romain Beaumont. 2022. Large Scale OpenCLIP: L\/14 H\/14 AND G\/14 Trained on LAION-2B. https:\/\/laion.ai\/blog\/large-openclip\/"},{"key":"e_1_3_2_2_3_1","unstructured":"Stas Bekman. 2022. The Technology Behind Bloom Training. https:\/\/huggingface.co\/blog\/bloom-megatron-deepspeed  Stas Bekman. 2022. The Technology Behind Bloom Training. https:\/\/huggingface.co\/blog\/bloom-megatron-deepspeed"},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.38.8.716"},{"key":"e_1_3_2_2_5_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel Ziegler Jeffrey Wu Clemens Winter Chris Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems (NeurIPS). https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2020\/hash\/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html  Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel Ziegler Jeffrey Wu Clemens Winter Chris Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems (NeurIPS). https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2020\/hash\/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html"},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-68928-5"},{"key":"e_1_3_2_2_7_1","unstructured":"Tianqi Chen Bing Xu Chiyuan Zhang and Carlos Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174 [cs.LG]  Tianqi Chen Bing Xu Chiyuan Zhang and Carlos Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174 [cs.LG]"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS49936.2021.00035"},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1423"},{"key":"e_1_3_2_2_10_1","volume-title":"USENIX Symposium on Networked Systems Design and Implementation (NSDI). https:\/\/www.usenix.org\/conference\/nsdi22\/presentation\/eisenman","author":"Eisenman Assaf","year":"2022","unstructured":"Assaf Eisenman , Kiran Kumar Matam , Steven Ingram , Dheevatsa Mudigere , Raghuraman Krishnamoorthi , Krishnakumar Nair , Misha Smelyanskiy , and Murali Annavaram . 2022 . Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models . In USENIX Symposium on Networked Systems Design and Implementation (NSDI). https:\/\/www.usenix.org\/conference\/nsdi22\/presentation\/eisenman Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. 2022. Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models. In USENIX Symposium on Networked Systems Design and Implementation (NSDI). https:\/\/www.usenix.org\/conference\/nsdi22\/presentation\/eisenman"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3437801.3441593"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3126908.3126937"},{"key":"e_1_3_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/s12532-011-0026-8"},{"key":"e_1_3_2_2_14_1","unstructured":"Horovod. 2019. Elastic Horovod. https:\/\/horovod.readthedocs.io\/en\/stable\/elastic_include.html  Horovod. 2019. Elastic Horovod. https:\/\/horovod.readthedocs.io\/en\/stable\/elastic_include.html"},{"key":"e_1_3_2_2_15_1","unstructured":"Yanping Huang Youlong Cheng Ankur Bapna Orhan Firat Dehao Chen Mia Chen HyoukJoong Lee Jiquan Ngiam Quoc V Le Yonghui Wu and zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Advances in Neural Information Processing Systems (NeurIPS). https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2019\/hash\/093f65e080a295f8076b1c5722a46aa2-Abstract.html  Yanping Huang Youlong Cheng Ankur Bapna Orhan Firat Dehao Chen Mia Chen HyoukJoong Lee Jiquan Ngiam Quoc V Le Yonghui Wu and zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Advances in Neural Information Processing Systems (NeurIPS). https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2019\/hash\/093f65e080a295f8076b1c5722a46aa2-Abstract.html"},{"key":"e_1_3_2_2_16_1","volume-title":"Elastic Resource Sharing for Distributed Deep Learning. In USENIX Symposium on Networked Systems Design and Implementation (NSDI). https:\/\/www.usenix.org\/conference\/nsdi21\/presentation\/hwang","author":"Hwang Changho","year":"2021","unstructured":"Changho Hwang , Taehyun Kim , Sunghyun Kim , Jinwoo Shin , and KyoungSoo Park . 2021 . Elastic Resource Sharing for Distributed Deep Learning. In USENIX Symposium on Networked Systems Design and Implementation (NSDI). https:\/\/www.usenix.org\/conference\/nsdi21\/presentation\/hwang Changho Hwang, Taehyun Kim, Sunghyun Kim, Jinwoo Shin, and KyoungSoo Park. 2021. Elastic Resource Sharing for Distributed Deep Learning. In USENIX Symposium on Networked Systems Design and Implementation (NSDI). https:\/\/www.usenix.org\/conference\/nsdi21\/presentation\/hwang"},{"key":"e_1_3_2_2_17_1","volume-title":"Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In USENIX Annual Technical Conference (ATC). https:\/\/www.usenix.org\/conference\/atc19\/presentation\/jeon","author":"Jeon Myeongjae","year":"2019","unstructured":"Myeongjae Jeon , Shivaram Venkataraman , Amar Phanishayee , Junjie Qian , Wencong Xiao , and Fan Yang . 2019 . Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In USENIX Annual Technical Conference (ATC). https:\/\/www.usenix.org\/conference\/atc19\/presentation\/jeon Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In USENIX Annual Technical Conference (ATC). https:\/\/www.usenix.org\/conference\/atc19\/presentation\/jeon"},{"key":"e_1_3_2_2_18_1","volume-title":"USENIX Annual Technical Conference (ATC). https:\/\/www.usenix.org\/conference\/atc22\/presentation\/jia-xianyan","author":"Jia Xianyan","year":"2022","unstructured":"Xianyan Jia , Le Jiang , Ang Wang , Wencong Xiao , Ziji Shi , Jie Zhang , Xinyuan Li , Langshi Chen , Yong Li , Zhen Zheng , Xiaoyong Liu , and Wei Lin . 2022 . Whale: Efficient Giant Model Training over Heterogeneous GPUs . In USENIX Annual Technical Conference (ATC). https:\/\/www.usenix.org\/conference\/atc22\/presentation\/jia-xianyan Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, Xiaoyong Liu, and Wei Lin. 2022. Whale: Efficient Giant Model Training over Heterogeneous GPUs. In USENIX Annual Technical Conference (ATC). https:\/\/www.usenix.org\/conference\/atc22\/presentation\/jia-xianyan"},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2023.3247001"},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC55918.2022.00029"},{"key":"e_1_3_2_2_21_1","volume-title":"Aryl: An Elastic Cluster Scheduler for Deep Learning. arXiv:2202.07896 [cs.DC]","author":"Li Jiamin","year":"2022","unstructured":"Jiamin Li , Hong Xu , Yibo Zhu , Zherui Liu , Chuanxiong Guo , and Cong Wang . 2022 . Aryl: An Elastic Cluster Scheduler for Deep Learning. arXiv:2202.07896 [cs.DC] Jiamin Li, Hong Xu, Yibo Zhu, Zherui Liu, Chuanxiong Guo, and Cong Wang. 2022. Aryl: An Elastic Cluster Scheduler for Deep Learning. arXiv:2202.07896 [cs.DC]"},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476145"},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415530"},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3534678.3539070"},{"key":"e_1_3_2_2_25_1","volume-title":"Pointer Sentinel Mixture Models. In International Conference on Learning Representations (ICLR). https:\/\/openreview.net\/forum?id=Byj72udxe","author":"Merity Stephen","year":"2017","unstructured":"Stephen Merity , Caiming Xiong , James Bradbury , and Richard Socher . 2017 . Pointer Sentinel Mixture Models. In International Conference on Learning Representations (ICLR). https:\/\/openreview.net\/forum?id=Byj72udxe Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer Sentinel Mixture Models. In International Conference on Learning Representations (ICLR). https:\/\/openreview.net\/forum?id=Byj72udxe"},{"key":"e_1_3_2_2_26_1","unstructured":"Microsoft. 2022. Varuna. https:\/\/github.com\/microsoft\/varuna  Microsoft. 2022. Varuna. https:\/\/github.com\/microsoft\/varuna"},{"key":"e_1_3_2_2_27_1","unstructured":"MinIO. 2023. MinIO: High Performance Object Storage for AI. https:\/\/github.com\/minio\/minio  MinIO. 2023. MinIO: High Performance Object Storage for AI. https:\/\/github.com\/minio\/minio"},{"key":"e_1_3_2_2_28_1","volume-title":"Fine-Grained DNN Checkpointing. In USENIX Conference on File and Storage Technologies (FAST). https:\/\/www.usenix.org\/conference\/fast21\/presentation\/mohan","author":"Mohan Jayashree","year":"2021","unstructured":"Jayashree Mohan , Amar Phanishayee , and Vijay Chidambaram . 2021 . CheckFreq: Frequent , Fine-Grained DNN Checkpointing. In USENIX Conference on File and Storage Technologies (FAST). https:\/\/www.usenix.org\/conference\/fast21\/presentation\/mohan Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. 2021. CheckFreq: Frequent, Fine-Grained DNN Checkpointing. In USENIX Conference on File and Storage Technologies (FAST). https:\/\/www.usenix.org\/conference\/fast21\/presentation\/mohan"},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_3_2_2_31_1","unstructured":"Maxim Naumov Dheevatsa Mudigere Hao-Jun Michael Shi Jianyu Huang Narayanan Sundaraman Jongsoo Park Xiaodong Wang Udit Gupta Carole-Jean Wu Alisson G. Azzolini Dmytro Dzhulgakov Andrey Mallevich Ilia Cherniavskii Yinghai Lu Raghuraman Krishnamoorthi Ansha Yu Volodymyr Kondratenko Stephanie Pereira Xianjie Chen Wenlin Chen Vijay Rao Bill Jia Liang Xiong and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. arXiv:1906.00091 [cs.IR]  Maxim Naumov Dheevatsa Mudigere Hao-Jun Michael Shi Jianyu Huang Narayanan Sundaraman Jongsoo Park Xiaodong Wang Udit Gupta Carole-Jean Wu Alisson G. Azzolini Dmytro Dzhulgakov Andrey Mallevich Ilia Cherniavskii Yinghai Lu Raghuraman Krishnamoorthi Ansha Yu Volodymyr Kondratenko Stephanie Pereira Xianjie Chen Wenlin Chen Vijay Rao Bill Jia Liang Xiong and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. arXiv:1906.00091 [cs.IR]"},{"key":"e_1_3_2_2_33_1","volume-title":"USENIX Annual Technical Conference (ATC). https:\/\/www.usenix.org\/conference\/atc20\/presentation\/park","author":"Park Jay H.","year":"2020","unstructured":"Jay H. Park , Gyeongchan Yun , Chang M. Yi , Nguyen T. Nguyen , Seungmin Lee , Jaesik Choi , Sam H. Noh , and Young ri Choi . 2020 . Het-Pipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism . In USENIX Annual Technical Conference (ATC). https:\/\/www.usenix.org\/conference\/atc20\/presentation\/park Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi, Sam H. Noh, and Young ri Choi. 2020. Het-Pipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism. In USENIX Annual Technical Conference (ATC). https:\/\/www.usenix.org\/conference\/atc20\/presentation\/park"},{"key":"e_1_3_2_2_34_1","unstructured":"PyTorch. 2020. Torch Elastic. https:\/\/pytorch.org\/elastic\/latest\/  PyTorch. 2020. Torch Elastic. https:\/\/pytorch.org\/elastic\/latest\/"},{"key":"e_1_3_2_2_35_1","volume-title":"Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, and Eric P. Xing.","author":"Qiao Aurick","year":"2021","unstructured":"Aurick Qiao , Sang Keun Choe , Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, and Eric P. Xing. 2021 . Pollux : Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning. In USENIX Symposium on Operating Systems Design and Implementation (OSDI) . https:\/\/www.usenix.org\/conference\/osdi21\/presentation\/qiao Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, and Eric P. Xing. 2021. Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning. In USENIX Symposium on Operating Systems Design and Implementation (OSDI). https:\/\/www.usenix.org\/conference\/osdi21\/presentation\/qiao"},{"key":"e_1_3_2_2_36_1","unstructured":"Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei Ilya Sutskever etal 2019. Language Models are Unsupervised Multitask Learners. https:\/\/d4mucfpksywv.cloudfront.net\/better-language-models\/language-models.pdf  Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei Ilya Sutskever et al. 2019. Language Models are Unsupervised Multitask Learners. https:\/\/d4mucfpksywv.cloudfront.net\/better-language-models\/language-models.pdf"},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476205"},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1093\/acprof:oso\/9780198568209.001.0001"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3406703"},{"key":"e_1_3_2_2_41_1","volume-title":"Proceedings of Machine Learning and Systems (MLSys). https:\/\/proceedings.mlsys.org\/paper_files\/paper\/2022\/hash\/7c98f9c7ab2df90911da23f9ce72ed6e-Abstract.html","author":"Reed James","year":"2022","unstructured":"James Reed , Zachary DeVito , Horace He , Ansley Ussery , and Jason Ansel . 2022 . torch.fx: Practical Program Capture and Transformation for Deep Learning in Python . In Proceedings of Machine Learning and Systems (MLSys). https:\/\/proceedings.mlsys.org\/paper_files\/paper\/2022\/hash\/7c98f9c7ab2df90911da23f9ce72ed6e-Abstract.html James Reed, Zachary DeVito, Horace He, Ansley Ussery, and Jason Ansel. 2022. torch.fx: Practical Program Capture and Transformation for Deep Learning in Python. In Proceedings of Machine Learning and Systems (MLSys). https:\/\/proceedings.mlsys.org\/paper_files\/paper\/2022\/hash\/7c98f9c7ab2df90911da23f9ce72ed6e-Abstract.html"},{"key":"e_1_3_2_2_42_1","volume-title":"ZeRO-Offload: Democratizing Billion-Scale Model Training. In USENIX Annual Technical Conference (ATC). https:\/\/www.usenix.org\/conference\/atc21\/presentation\/ren-jie","author":"Ren Jie","year":"2021","unstructured":"Jie Ren , Samyam Rajbhandari , Reza Yazdani Aminabadi , Olatunji Ruwase , Shuangyan Yang , Minjia Zhang , Dong Li , and Yuxiong He . 2021 . ZeRO-Offload: Democratizing Billion-Scale Model Training. In USENIX Annual Technical Conference (ATC). https:\/\/www.usenix.org\/conference\/atc21\/presentation\/ren-jie Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. In USENIX Annual Technical Conference (ATC). https:\/\/www.usenix.org\/conference\/atc21\/presentation\/ren-jie"},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.2307\/2032755"},{"key":"e_1_3_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2009.4"},{"key":"e_1_3_2_2_45_1","volume-title":"Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro.","author":"Smith Shaden","year":"2022","unstructured":"Shaden Smith , Mostofa Patwary , Brandon Norick , Patrick LeGresley , Samyam Rajbhandari , Jared Casper , Zhun Liu , Shrimai Prabhumoye , George Zerveas , Vijay Korthikanti , Elton Zhang , Rewon Child , Reza Yazdani Aminabadi , Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. 2022 . Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model . arXiv:2201.11990 [cs.CL] Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. 2022. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. arXiv:2201.11990 [cs.CL]"},{"key":"e_1_3_2_2_46_1","unstructured":"UCLA System. 2023. Bamboo. https:\/\/github.com\/uclasystem\/bamboo  UCLA System. 2023. Bamboo. https:\/\/github.com\/uclasystem\/bamboo"},{"key":"e_1_3_2_2_47_1","unstructured":"Jakub M Tarnawski Deepak Narayanan and Amar Phanishayee. 2021. Piper: Multidimensional Planner for DNN Parallelization. In Advances in Neural Information Processing Systems (NeurIPS). https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2021\/hash\/d01eeca8b24321cd2fe89dd85b9beb51-Abstract.html  Jakub M Tarnawski Deepak Narayanan and Amar Phanishayee. 2021. Piper: Multidimensional Planner for DNN Parallelization. In Advances in Neural Information Processing Systems (NeurIPS). https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2021\/hash\/d01eeca8b24321cd2fe89dd85b9beb51-Abstract.html"},{"key":"e_1_3_2_2_48_1","unstructured":"John Thorpe Pengzhan Zhao Jonathan Eyolfson Yifan Qiao Zhihao Jia Minjia Zhang Ravi Netravali and Guoqing Harry Xu. 2023. Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs. In USENIX Symposium on Networked Systems Design and Implementation (NSDI). https:\/\/www.usenix.org\/conference\/nsdi23\/presentation\/thorpe  John Thorpe Pengzhan Zhao Jonathan Eyolfson Yifan Qiao Zhihao Jia Minjia Zhang Ravi Netravali and Guoqing Harry Xu. 2023. Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs. In USENIX Symposium on Networked Systems Design and Implementation (NSDI). https:\/\/www.usenix.org\/conference\/nsdi23\/presentation\/thorpe"},{"key":"e_1_3_2_2_49_1","unstructured":"Pablo Villalobos Jaime Sevilla Tamay Besiroglu Lennart Heim Anson Ho and Marius Hobbhahn. 2022. Machine Learning Model Sizes and the Parameter Gap. arXiv:2207.02852 [cs.LG]  Pablo Villalobos Jaime Sevilla Tamay Besiroglu Lennart Heim Anson Ho and Marius Hobbhahn. 2022. Machine Learning Model Sizes and the Parameter Gap. arXiv:2207.02852 [cs.LG]"},{"key":"e_1_3_2_2_50_1","unstructured":"Qizhen Weng Wencong Xiao Yinghao Yu Wei Wang Cheng Wang Jian He Yong Li Liping Zhang Wei Lin and Yu Ding. 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters. In USENIX Symposium on Networked Systems Design and Implementation (NSDI). https:\/\/www.usenix.org\/conference\/nsdi22\/presentation\/weng  Qizhen Weng Wencong Xiao Yinghao Yu Wei Wang Cheng Wang Jian He Yong Li Liping Zhang Wei Lin and Yu Ding. 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters. In USENIX Symposium on Networked Systems Design and Implementation (NSDI). https:\/\/www.usenix.org\/conference\/nsdi22\/presentation\/weng"},{"key":"e_1_3_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"e_1_3_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDCS47774.2020.00018"},{"key":"e_1_3_2_2_53_1","volume-title":"Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer.","author":"Zhang Susan","year":"2022","unstructured":"Susan Zhang , Stephen Roller , Naman Goyal , Mikel Artetxe , Moya Chen , Shuohui Chen , Christopher Dewan , Mona Diab , Xian Li , Xi Victoria Lin , Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022 . OPT : Open Pre-trained Transformer Language Models . arXiv:2205.01068 [cs.CL] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. OPT: Open Pre-trained Transformer Language Models. arXiv:2205.01068 [cs.CL]"},{"key":"e_1_3_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611479.3611514"},{"key":"e_1_3_2_2_55_1","doi-asserted-by":"crossref","unstructured":"Yanli Zhao Andrew Gu Rohan Varma Liang Luo Chien-Chin Huang Min Xu Less Wright Hamid Shojanazeri Myle Ott Sam Shleifer Alban Desmaison Can Balioglu Bernard Nguyen Geeta Chauhan Yuchen Hao and Shen Li. 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. arXiv:2304.11277 [cs.DC]  Yanli Zhao Andrew Gu Rohan Varma Liang Luo Chien-Chin Huang Min Xu Less Wright Hamid Shojanazeri Myle Ott Sam Shleifer Alban Desmaison Can Balioglu Bernard Nguyen Geeta Chauhan Yuchen Hao and Shen Li. 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. arXiv:2304.11277 [cs.DC]","DOI":"10.14778\/3611540.3611569"},{"key":"e_1_3_2_2_56_1","unstructured":"Lianmin Zheng Zhuohan Li Hao Zhang Yonghao Zhuang Zhifeng Chen Yanping Huang Yida Wang Yuanzhong Xu Danyang Zhuo Eric P. Xing Joseph E. Gonzalez and Ion Stoica. 2022. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. In USENIX Symposium on Operating Systems Design and Implementation (OSDI). https:\/\/www.usenix.org\/conference\/osdi22\/presentation\/zheng-lianmin  Lianmin Zheng Zhuohan Li Hao Zhang Yonghao Zhuang Zhifeng Chen Yanping Huang Yida Wang Yuanzhong Xu Danyang Zhuo Eric P. Xing Joseph E. Gonzalez and Ion Stoica. 2022. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. In USENIX Symposium on Operating Systems Design and Implementation (OSDI). https:\/\/www.usenix.org\/conference\/osdi22\/presentation\/zheng-lianmin"}],"event":{"name":"SOSP '23: 29th Symposium on Operating Systems Principles","location":"Koblenz Germany","acronym":"SOSP '23","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems","USENIX"]},"container-title":["Proceedings of the 29th Symposium on Operating Systems Principles"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3600006.3613152","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/abs\/10.1145\/3600006.3613152","content-type":"text\/html","content-version":"vor","intended-application":"syndication"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:36:49Z","timestamp":1750178209000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3600006.3613152"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,23]]},"references-count":55,"alternative-id":["10.1145\/3600006.3613152","10.1145\/3600006"],"URL":"https:\/\/doi.org\/10.1145\/3600006.3613152","relation":{},"subject":[],"published":{"date-parts":[[2023,10,23]]},"assertion":[{"value":"2023-10-23","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}