{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T15:34:40Z","timestamp":1772724880464,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":66,"publisher":"ACM","funder":[{"DOI":"10.13039\/501100000038","name":"Natural Sciences and Engineering Research Council of Canada","doi-asserted-by":"publisher","award":["569162-2022"],"award-info":[{"award-number":["569162-2022"]}],"id":[{"id":"10.13039\/501100000038","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,6,8]]},"DOI":"10.1145\/3721145.3730418","type":"proceedings-article","created":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T12:57:17Z","timestamp":1755867437000},"page":"368-383","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer Models"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-0469-2050","authenticated-orcid":false,"given":"Runsheng Benson","family":"Guo","sequence":"first","affiliation":[{"name":"Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-7462-1694","authenticated-orcid":false,"given":"Utkarsh","family":"Anand","sequence":"additional","affiliation":[{"name":"Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-4415-3525","authenticated-orcid":false,"given":"Arthur","family":"Chen","sequence":"additional","affiliation":[{"name":"Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-7036-0397","authenticated-orcid":false,"given":"Khuzaima","family":"Daudjee","sequence":"additional","affiliation":[{"name":"Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,8,22]]},"reference":[{"key":"e_1_3_3_1_2_2","unstructured":"Marah Abdin Sam\u00a0Ade Jacobs Ammar\u00a0Ahmad Awan Jyoti Aneja Ahmed Awadallah Hany Awadalla Nguyen Bach Amit Bahree Arash Bakhtiari Harkirat Behl et\u00a0al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2404.14219 (2024)."},{"key":"e_1_3_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/HiPC62374.2024.00019"},{"key":"e_1_3_3_1_4_2","unstructured":"Amazon Web Services Inc.2024. AWS Cloud Computing. https:\/\/aws.amazon.com\/ Accessed: 2024-02-06."},{"key":"e_1_3_3_1_5_2","volume-title":"PyTorch on ROCm","year":"2025","unstructured":"AMD. 2025. PyTorch on ROCm. https:\/\/rocm.docs.amd.com\/projects\/install-on-linux\/en\/latest\/install\/3rd-party\/pytorch-install.html Accessed: 2025-02-24."},{"key":"e_1_3_3_1_6_2","series-title":"(NIPS \u201920)","volume-title":"Proceedings of the 34th International Conference on Neural Information Processing Systems","author":"Brown Tom\u00a0B.","year":"2020","unstructured":"Tom\u00a0B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel\u00a0M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS \u201920). Curran Associates Inc., Red Hook, NY, USA, Article 159, 25\u00a0pages."},{"key":"e_1_3_3_1_7_2","volume-title":"Conference on Neural Information Processing Systems (NeurIPS)","author":"Chen Xi","year":"2022","unstructured":"Xi Chen and Xiao Wang. 2022. PaLI: Scaling Language-Image Learning in 100+ Languages. In Conference on Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_3_3_1_8_2","volume-title":"torch.distributed.fsdp.FullyShardedDataParallel","author":"Contributors PyTorch","year":"2025","unstructured":"PyTorch Contributors. 2025. torch.distributed.fsdp.FullyShardedDataParallel. PyTorch. https:\/\/pytorch.org\/docs\/stable\/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel"},{"key":"e_1_3_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/N19-1423"},{"key":"e_1_3_3_1_10_2","volume-title":"International Conference on Learning Representations","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=YicbFdNTTy"},{"key":"e_1_3_3_1_11_2","series-title":"(NSDI\u201924)","volume-title":"Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation","author":"Duan Jiangfei","year":"2024","unstructured":"Jiangfei Duan, Ziang Song, Xupeng Miao, Xiaoli Xi, Dahua Lin, Harry Xu, Minjia Zhang, and Zhihao Jia. 2024. Parcae: proactive, liveput-optimized DNN training on preemptible instances. In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (Santa Clara, CA, USA) (NSDI\u201924). USENIX Association, USA, Article 62, 19\u00a0pages."},{"key":"e_1_3_3_1_12_2","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et\u00a0al. 2024. The llama 3 herd of models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2407.21783 (2024)."},{"key":"e_1_3_3_1_13_2","volume-title":"OpenLLaMA: An Open Reproduction of LLaMA","author":"Geng Xinyang","year":"2023","unstructured":"Xinyang Geng and Hao Liu. 2023. OpenLLaMA: An Open Reproduction of LLaMA. https:\/\/github.com\/openlm-research\/open_llama"},{"key":"e_1_3_3_1_14_2","unstructured":"Runsheng Guo Victor Guo Antonio Kim Josh Hildred and Khuzaima Daudjee. 2022. Hydrozoa: Dynamic hybrid-parallel dnn training on serverless containers. Proceedings of Machine Learning and Systems 4 (2022) 779\u2013794."},{"key":"e_1_3_3_1_15_2","unstructured":"Ronghang Hu Vaibhav Singh Jack Cao Milad Mohammadi Yeounoh Chung Shauheen Zahirazami and Ross Girshick. 2022. Scaling PyTorch Models on Cloud TPUs with FSDP. https:\/\/pytorch.org\/blog\/scaling-pytorch-models-on-cloud-tpus-with-fsdp\/ Accessed: 2025-02-24."},{"key":"e_1_3_3_1_16_2","unstructured":"Yanping Huang Youlong Cheng Ankur Bapna Orhan Firat Dehao Chen Mia Chen HyoukJoong Lee Jiquan Ngiam Quoc\u00a0V Le Yonghui Wu et\u00a0al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_3_3_1_17_2","first-page":"673","volume-title":"2022 USENIX Annual Technical Conference (USENIX ATC 22)","author":"Jia Xianyan","year":"2022","unstructured":"Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, Xiaoyong Liu, and Wei Lin. 2022. Whale: Efficient Giant Model Training over Heterogeneous GPUs. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). USENIX Association, Carlsbad, CA, 673\u2013688. https:\/\/www.usenix.org\/conference\/atc22\/presentation\/jia-xianyan"},{"key":"e_1_3_3_1_18_2","unstructured":"Youhe Jiang Fangcheng Fu Xiaozhe Yao Guoliang He Xupeng Miao Ana Klimovic Bin Cui Binhang Yuan and Eiko Yoneki. 2025. Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs. arxiv:https:\/\/arXiv.org\/abs\/2502.00722\u00a0[cs.DC] https:\/\/arxiv.org\/abs\/2502.00722"},{"key":"e_1_3_3_1_19_2","series-title":"(NSDI\u201924)","volume-title":"Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation","author":"Jiang Ziheng","year":"2024","unstructured":"Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, and Xin Liu. 2024. MegaScale: scaling large language model training to more than 10,000 GPUs. In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (Santa Clara, CA, USA) (NSDI\u201924). USENIX Association, USA, Article 41, 16\u00a0pages."},{"key":"e_1_3_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589350"},{"key":"e_1_3_3_1_21_2","doi-asserted-by":"crossref","unstructured":"Kyeonglok Kim Hyeonsu Lee Seungmin Oh and Euiseong Seo. 2022. Scale-Train: A Scalable DNN Training Framework for a Heterogeneous GPU Cloud. IEEE Access 10 (2022) 68468\u201368481.","DOI":"10.1109\/ACCESS.2022.3184692"},{"key":"e_1_3_3_1_22_2","volume-title":"3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings","author":"Kingma Diederik\u00a0P.","year":"2015","unstructured":"Diederik\u00a0P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http:\/\/arxiv.org\/abs\/1412.6980"},{"key":"e_1_3_3_1_23_2","unstructured":"Joel Lamy-Poirier. 2021. Layered gradient accumulation and modular pipeline parallelism: fast and efficient training of large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2106.02679 (2021)."},{"key":"e_1_3_3_1_24_2","doi-asserted-by":"publisher","unstructured":"Shen Li Yanli Zhao Rohan Varma Omkar Salpekar Pieter Noordhuis Teng Li Adam Paszke Jeff Smith Brian Vaughan Pritam Damania and Soumith Chintala. 2020. PyTorch distributed: experiences on accelerating data parallel training. Proc. VLDB Endow. 13 12 (Aug. 2020) 3005\u20133018. 10.14778\/3415478.3415530","DOI":"10.14778\/3415478.3415530"},{"key":"e_1_3_3_1_25_2","unstructured":"MarketsandMarkets. 2023. Nvidia\u2019s Dominance in the AI Chip Market. https:\/\/www.marketsandmarkets.com\/blog\/SE\/nvidia-dominance-in-the-ai-chip-market"},{"key":"e_1_3_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452773"},{"key":"e_1_3_3_1_27_2","doi-asserted-by":"crossref","unstructured":"Xupeng Miao Yining Shi Zhi Yang Bin Cui and Zhihao Jia. 2023. Sdpipe: A semi-decentralized framework for heterogeneity-aware pipeline-parallel training. Proceedings of the VLDB Endowment 16 9 (2023) 2354\u20132363.","DOI":"10.14778\/3598581.3598604"},{"key":"e_1_3_3_1_28_2","doi-asserted-by":"publisher","unstructured":"Xupeng Miao Yujie Wang Youhe Jiang Chunan Shi Xiaonan Nie Hailin Zhang and Bin Cui. 2022. Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism. Proc. VLDB Endow. 16 3 (nov 2022) 470\u2013479. 10.14778\/3570690.3570697","DOI":"10.14778\/3570690.3570697"},{"key":"e_1_3_3_1_29_2","volume-title":"International Conference on Learning Representations","author":"Micikevicius Paulius","year":"2018","unstructured":"Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=r1gs9JgRZ"},{"key":"e_1_3_3_1_30_2","doi-asserted-by":"crossref","unstructured":"Sergio Moreno-Alvarez Juan\u00a0M Haut Mercedes\u00a0E Paoletti Juan\u00a0A Rico-Gallego Juan\u00a0C Diaz-Martin and Javier Plaza. 2020. Training deep neural networks: a static load balancing approach. The Journal of Supercomputing 76 (2020) 9739\u20139754.","DOI":"10.1007\/s11227-020-03200-6"},{"key":"e_1_3_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_3_1_33_2","first-page":"7937","volume-title":"International Conference on Machine Learning","author":"Narayanan Deepak","year":"2021","unstructured":"Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning. PMLR, 7937\u20137947."},{"key":"e_1_3_3_1_34_2","first-page":"481","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Narayanan Deepak","year":"2020","unstructured":"Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 481\u2013498. https:\/\/www.usenix.org\/conference\/osdi20\/presentation\/narayanan-deepak"},{"key":"e_1_3_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_3_3_1_36_2","unstructured":"NVIDIA. 2024. NCCL: NVIDIA Collective Communications Library. https:\/\/developer.nvidia.com\/nccl."},{"key":"e_1_3_3_1_37_2","first-page":"307","volume-title":"2020 USENIX Annual Technical Conference (USENIX ATC 20)","author":"Park Jay\u00a0H","year":"2020","unstructured":"Jay\u00a0H Park, Gyeongchan Yun, M\u00a0Yi Chang, Nguyen\u00a0T Nguyen, Seungmin Lee, Jaesik Choi, Sam\u00a0H Noh, and Young-ri Choi. 2020. { HetPipe} : Enabling large { DNN} training on (whimpy) heterogeneous { GPU} clusters through integration of pipelined model parallelism and data parallelism. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). 307\u2013321."},{"key":"e_1_3_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC59245.2023.00026"},{"key":"e_1_3_3_1_39_2","unstructured":"PyTorch. 2023. Training a 1 Trillion Parameter Model with PyTorch Fully Sharded Data Parallel on AWS. https:\/\/shorturl.at\/6Y4LT. Accessed: 2024-01-30."},{"key":"e_1_3_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_3_3_1_41_2","first-page":"551","volume-title":"2021 USENIX Annual Technical Conference (USENIX ATC 21)","author":"Ren Jie","year":"2021","unstructured":"Jie Ren, Samyam Rajbhandari, Reza\u00a0Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 551\u2013564. https:\/\/www.usenix.org\/conference\/atc21\/presentation\/ren-jie"},{"key":"e_1_3_3_1_42_2","doi-asserted-by":"crossref","unstructured":"Timo Schick and Hinrich Sch\u00fctze. 2020. It\u2019s not just size that matters: Small language models are also few-shot learners. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2009.07118 (2020).","DOI":"10.18653\/v1\/2021.naacl-main.185"},{"key":"e_1_3_3_1_43_2","unstructured":"Alexander Sergeev and Mike Del\u00a0Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1802.05799 (2018)."},{"key":"e_1_3_3_1_44_2","unstructured":"Noam Shazeer Youlong Cheng Niki Parmar Dustin Tran Ashish Vaswani Penporn Koanantakool Peter Hawkins HyoukJoong Lee Mingsheng Hong Cliff Young et\u00a0al. 2018. Mesh-tensorflow: Deep learning for supercomputers. Advances in neural information processing systems 31 (2018)."},{"key":"e_1_3_3_1_45_2","unstructured":"Mohammad Shoeybi Mostofa Patwary Raul Puri Patrick LeGresley Jared Casper and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR abs\/1909.08053 (2019). http:\/\/arxiv.org\/abs\/1909.08053"},{"key":"e_1_3_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3577193.3593704"},{"key":"e_1_3_3_1_47_2","unstructured":"Shaden Smith Mostofa Patwary Brandon Norick Patrick LeGresley Samyam Rajbhandari Jared Casper Zhun Liu Shrimai Prabhumoye George Zerveas Vijay Korthikanti et\u00a0al. 2022. Using deepspeed and megatron to train megatron-turing nlg 530b a large-scale generative language model. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2201.11990 (2022)."},{"key":"e_1_3_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3642970.3655843"},{"key":"e_1_3_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613175"},{"key":"e_1_3_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/3357384.3357895"},{"key":"e_1_3_3_1_51_2","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timoth\u00e9e Lacroix Baptiste Rozi\u00e8re Naman Goyal Eric Hambro Faisal Azhar et\u00a0al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2302.13971 (2023)."},{"key":"e_1_3_3_1_52_2","first-page":"563","volume-title":"2024 USENIX Annual Technical Conference (USENIX ATC 24)","author":"Um Taegeon","year":"2024","unstructured":"Taegeon Um, Byungsoo Oh, Minyoung Kang, Woo-Yeon Lee, Goeun Kim, Dongseob Kim, Youngtaek Kim, Mohd Muzzammil, and Myeongjae Jeon. 2024. Metis: Fast Automatic Distributed Training on Heterogeneous { GPUs}. In 2024 USENIX Annual Technical Conference (USENIX ATC 24). 563\u2013578."},{"key":"e_1_3_3_1_53_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan\u00a0N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_3_1_54_2","volume-title":"ICLR 2024","author":"Wang Guanhua","year":"2024","unstructured":"Guanhua Wang, Heyang Qin, Sam Ade\u00a0Jacobs, Xiaoxia Wu, Connor Holmes, Zhewei Yao, Samyam Rajbhandari, Olatunji Ruwase, Feng Yang, Lei Yang, and Yuxiong He. 2024. ZeRO++: Extremely Efficient Collective Communication for Large Model Training. In ICLR 2024. https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-extremely-efficient-collective-communication-for-large-model-training\/"},{"key":"e_1_3_3_1_55_2","volume-title":"Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping","author":"Wang Guanhua","year":"2024","unstructured":"Guanhua Wang, Chengming Zhang, Zheyu Shen, Ang Li, and Olatunji Ruwase. 2024. Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping. Technical Report MSR-TR-2024-40. Microsoft. https:\/\/www.microsoft.com\/en-us\/research\/publication\/domino-eliminating-communication-in-llm-training-via-generic-tensor-slicing-and-overlapping\/"},{"key":"e_1_3_3_1_56_2","unstructured":"Yifu Wang Horace He Less Wright Luca Wehrstedt Tianyu Liu and Wanchao Liang. 2024. Distributed w\/ TorchTitan: Introducing Async Tensor Parallelism in PyTorch. https:\/\/discuss.pytorch.org\/t\/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch\/209487 Accessed: 2025-02-24."},{"key":"e_1_3_3_1_57_2","unstructured":"Alex Woodie. 2023. How AWS Plans to Cope with GenaAI\u2019s Insatiable Desire for Compute. Datanami (11 Dec 2023). https:\/\/shorturl.at\/Gx69T Accessed: 2024-02-06."},{"key":"e_1_3_3_1_58_2","first-page":"595","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Xiao Wencong","year":"2018","unstructured":"Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 595\u2013610. https:\/\/www.usenix.org\/conference\/osdi18\/presentation\/xiao"},{"key":"e_1_3_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41406.2024.00100"},{"key":"e_1_3_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/HiPC62374.2024.00015"},{"key":"e_1_3_3_1_61_2","unstructured":"Ran Yan Youhe Jiang Xiaonan Nie Fangcheng Fu Bin Cui and Binhang Yuan. 2025. HexiScale: Accommodating Large Language Model Training over Heterogeneous Environment. arxiv:https:\/\/arXiv.org\/abs\/2409.01143\u00a0[cs.DC] https:\/\/arxiv.org\/abs\/2409.01143"},{"key":"e_1_3_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01179"},{"key":"e_1_3_3_1_63_2","series-title":"(ICML\u201920)","volume-title":"Proceedings of the 37th International Conference on Machine Learning","author":"Zhang Jingqing","year":"2020","unstructured":"Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter\u00a0J. Liu. 2020. PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the 37th International Conference on Machine Learning(ICML\u201920). JMLR.org, Article 1051, 12\u00a0pages."},{"key":"e_1_3_3_1_64_2","unstructured":"Peiyuan Zhang Guangtao Zeng Tianduo Wang and Wei Lu. 2024. TinyLlama: An Open-Source Small Language Model. arxiv:https:\/\/arXiv.org\/abs\/2401.02385\u00a0[cs.CL]"},{"key":"e_1_3_3_1_65_2","doi-asserted-by":"publisher","DOI":"10.1145\/3627703.3629580"},{"key":"e_1_3_3_1_66_2","doi-asserted-by":"publisher","unstructured":"Yanli Zhao Andrew Gu Rohan Varma Liang Luo Chien-Chin Huang Min Xu Less Wright Hamid Shojanazeri Myle Ott Sam Shleifer Alban Desmaison Can Balioglu Pritam Damania Bernard Nguyen Geeta Chauhan Yuchen Hao Ajit Mathews and Shen Li. 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. Proc. VLDB Endow. 16 12 (Aug. 2023) 3848\u20133860. 10.14778\/3611540.3611569","DOI":"10.14778\/3611540.3611569"},{"key":"e_1_3_3_1_67_2","first-page":"559","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric\u00a0P Xing, et\u00a0al. 2022. Alpa: Automating inter-and { Intra-Operator} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559\u2013578."}],"event":{"name":"ICS '25: 2025 International Conference on Supercomputing","location":"Salt Lake City USA","acronym":"ICS '25","sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture"]},"container-title":["Proceedings of the 39th ACM International Conference on Supercomputing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3721145.3730418","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T13:03:11Z","timestamp":1755867791000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3721145.3730418"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,8]]},"references-count":66,"alternative-id":["10.1145\/3721145.3730418","10.1145\/3721145"],"URL":"https:\/\/doi.org\/10.1145\/3721145.3730418","relation":{},"subject":[],"published":{"date-parts":[[2025,6,8]]},"assertion":[{"value":"2025-08-22","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}