{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T15:49:39Z","timestamp":1775576979507,"version":"3.50.1"},"reference-count":95,"publisher":"Association for Computing Machinery (ACM)","issue":"3","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2025,6,17]]},"abstract":"<jats:p>As the scale of models and training data continues to grow, there is an expanding reliance on more GPUs to train large-scale models, which inevitably increases the likelihood of encountering dynamic stragglers that some devices lag behind in performance occasionally. However, hybrid parallel training, one of the de facto paradigms to train large models, is typically sensitive to the stragglers.<\/jats:p>\n                  <jats:p>\n                    This paper presents\n                    <jats:sc>Malleus<\/jats:sc>\n                    , a straggler-resilient hybrid parallel training framework for large-scale models.\n                    <jats:sc>Malleus<\/jats:sc>\n                    quantifies the stragglers at the nuanced, per-GPU granularity during training, and develops a novel planning algorithm to deduce the optimal parallelization of GPU devices, pipeline stages, model layers, and training data, maximizing training efficiency when stragglers exist. In addition, once a shift in the straggler situation is detected,\n                    <jats:sc>Malleus<\/jats:sc>\n                    adaptively adjusts the parallelization via a re-planning process, and seamlessly and efficiently migrates the model states on the fly, without sacrificing the stability of the training tasks. Empirical results on large language models with up to 110B parameters show that\n                    <jats:sc>Malleus<\/jats:sc>\n                    consistently outperforms existing parallel training frameworks under various straggler situations, delivering on average 2.63-5.28x of efficiency improvement.\n                  <\/jats:p>","DOI":"10.1145\/3725322","type":"journal-article","created":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T21:23:29Z","timestamp":1750281809000},"page":"1-28","source":"Crossref","is-referenced-by-count":4,"title":["Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization"],"prefix":"10.1145","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-5342-0194","authenticated-orcid":false,"given":"Haoyang","family":"Li","sequence":"first","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1658-0380","authenticated-orcid":false,"given":"Fangcheng","family":"Fu","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-3367-7486","authenticated-orcid":false,"given":"Hao","family":"Ge","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-1495-0499","authenticated-orcid":false,"given":"Sheng","family":"Lin","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-6458-7033","authenticated-orcid":false,"given":"Xuanyu","family":"Wang","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-0922-1942","authenticated-orcid":false,"given":"Jiawen","family":"Niu","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-8375-493X","authenticated-orcid":false,"given":"Yujie","family":"Wang","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-4188-7742","authenticated-orcid":false,"given":"Hailin","family":"Zhang","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6766-757X","authenticated-orcid":false,"given":"Xiaonan","family":"Nie","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1681-4677","authenticated-orcid":false,"given":"Bin","family":"Cui","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,6,18]]},"reference":[{"key":"e_1_2_2_1_1","unstructured":"2009. Optimization with PuLP. https:\/\/coin-or.github.io\/pulp\/."},{"key":"e_1_2_2_2_1","unstructured":"2019. Elastic Horovod. https:\/\/horovod.readthedocs.io\/en\/latest\/elastic_include.html."},{"key":"e_1_2_2_3_1","unstructured":"2023. Torch Distributed Elastic. https:\/\/pytorch.org\/docs\/stable\/distributed.elastic.html."},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3492321.3519584"},{"key":"e_1_2_2_5_1","volume-title":"Annual Conference on Neural Information Processing Systems 2020 (NeurIPS","author":"Brown Tom B.","year":"2020","unstructured":"Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Annual Conference on Neural Information Processing Systems 2020 (NeurIPS 2020)."},{"key":"e_1_2_2_6_1","doi-asserted-by":"crossref","unstructured":"Michael L Bynum Gabriel A Hackebeil William E Hart Carl D Laird Bethany L Nicholson John D Siirola Jean-Paul Watson David L Woodruff et al. 2021. Pyomo-optimization modeling in python. Vol. 67. Springer.","DOI":"10.1007\/978-3-030-68928-5_5"},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2022.3148237"},{"key":"e_1_2_2_8_1","first-page":"2631","article-title":"Biathlon","volume":"17","author":"Chang Chaokun","year":"2024","unstructured":"Chaokun Chang, Eric Lo, and Chunxiao Ye. 2024. Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines. Proc. VLDB Endow., Vol. 17, 10 (2024), 2631-2640.","journal-title":"Harnessing Model Resilience for Accelerating ML Inference Pipelines. Proc. VLDB Endow."},{"key":"e_1_2_2_9_1","first-page":"571","volume-title":"11th USENIX Symposium on Operating Systems Design and Implementation (OSDI","author":"Chilimbi Trishul M.","year":"2014","unstructured":"Trishul M. Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2014). USENIX Association, 571-582."},{"key":"e_1_2_2_10_1","volume-title":"International Conference on Learning Representations 2024 (ICLR","author":"Dao Tri","year":"2024","unstructured":"Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. In International Conference on Learning Representations 2024 (ICLR 2024)."},{"key":"e_1_2_2_11_1","volume-title":"Flash Attention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Annual Conference on Neural Information Processing Systems 2022 (NeurIPS","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2022. Flash Attention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Annual Conference on Neural Information Processing Systems 2022 (NeurIPS 2022)."},{"key":"e_1_2_2_12_1","first-page":"1232","volume-title":"Large Scale Distributed Deep Networks. In 26th Annual Conference on Neural Information Processing Systems 2012 (NeurIPS","author":"Dean Jeffrey","year":"2022","unstructured":"Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In 26th Annual Conference on Neural Information Processing Systems 2012 (NeurIPS 2022). 1232-1240."},{"key":"e_1_2_2_13_1","unstructured":"Abhimanyu Dubey Abhinav Jauhri et al. 2024. The Llama 3 Herd of Models. CoRR Vol. abs\/2407.21783 (2024)."},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.14778\/3648160.3648165"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.14778\/3665844.3665857"},{"key":"e_1_2_2_16_1","first-page":"3304","volume-title":"International Conference on Machine Learning 2020 (ICML","author":"Fu Fangcheng","year":"2020","unstructured":"Fangcheng Fu, Yuzheng Hu, Yihan He, Jiawei Jiang, Yingxia Shao, Ce Zhang, and Bin Cui. 2020. Don't Waste Your Bits! Squeeze Activations and Gradients for Deep Neural Networks via TinyScript. In International Conference on Machine Learning 2020 (ICML 2020). 3304-3314."},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.14778\/3503585.3503590"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3694715.3695960"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3694715.3695969"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-024-3872-3"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/2987550.2987554"},{"key":"e_1_2_2_22_1","first-page":"1223","volume-title":"Annual Conference on Neural Informatio Processing Systems 2013 (NeurIPS","author":"Ho Qirong","year":"2013","unstructured":"Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, and Eric P. Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In Annual Conference on Neural Informatio Processing Systems 2013 (NeurIPS 2013). 1223-1231."},{"key":"e_1_2_2_23_1","first-page":"103","volume-title":"Annual Conference on Neural Information Processing Systems 2019 (NeurIPS","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Annual Conference on Neural Information Processing Systems 2019 (NeurIPS 2019). 103-112."},{"key":"e_1_2_2_24_1","first-page":"721","volume-title":"Elastic Resource Sharing for Distributed Deep Learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI","author":"Hwang Changho","year":"2021","unstructured":"Changho Hwang, Taehyun Kim, Sunghyun Kim, Jinwoo Shin, and KyoungSoo Park. 2021. Elastic Resource Sharing for Distributed Deep Learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2021). 721-739."},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517848"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613152"},{"key":"e_1_2_2_27_1","first-page":"673","volume-title":"2022 USENIX Annual Technical Conference (ATC","author":"Jia Xianyan","year":"2022","unstructured":"Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, Xiaoyong Liu, and Wei Lin. 2022. Whale: Efficient Giant Model Training over Heterogeneous GPUs. In 2022 USENIX Annual Technical Conference (ATC 2022). 673-688."},{"key":"e_1_2_2_28_1","volume-title":"Proceedings of Machine Learning and Systems 2019 (MLSys","author":"Jia Zhihao","year":"2019","unstructured":"Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. In Proceedings of Machine Learning and Systems 2019 (MLSys 2019)."},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3035933"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196894"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2023\/238"},{"key":"e_1_2_2_32_1","first-page":"745","volume-title":"21st USENIX Symposium on Networked Systems Design and Implementation (NSDI","author":"Jiang Ziheng","year":"2024","unstructured":"Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, and Xin Liu. 2024. MegaScale: Scaling Large Language Model Training to More Than 10, 000 GPUs. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 2024). 745-760."},{"key":"e_1_2_2_33_1","volume-title":"Scaling Laws for Neural Language Models. CoRR","author":"Kaplan Jared","year":"2020","unstructured":"Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. CoRR, Vol. abs\/2001.08361 (2020)."},{"key":"e_1_2_2_34_1","volume-title":"Reducing Activation Recomputation in Large Transformer Models. CoRR","author":"Korthikanti Vijay","year":"2022","unstructured":"Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Reducing Activation Recomputation in Large Transformer Models. CoRR, Vol. abs\/2205.05198 (2022)."},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3639304"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.14778\/3659437.3659449"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.14778\/3681954.3682003"},{"key":"e_1_2_2_38_1","volume-title":"AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness. In Annual Conference on Neural Information Processing Systems 2022 (NeurIPS","author":"Li Dacheng","year":"2022","unstructured":"Dacheng Li, Hongyi Wang, Eric P. Xing, and Hao Zhang. 2022. AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness. In Annual Conference on Neural Information Processing Systems 2022 (NeurIPS 2022)."},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457542"},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3552326.3587445"},{"key":"e_1_2_2_41_1","first-page":"583","volume-title":"Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI","author":"Li Mu","year":"2014","unstructured":"Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2014). 583-598."},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415530"},{"key":"e_1_2_2_43_1","volume-title":"Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources. CoRR","author":"Lin Haibin","year":"2043","unstructured":"Haibin Lin, Hang Zhang, Yifei Ma, Tong He, Zhi Zhang, Sheng Zha, and Mu Li. 2019. Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources. CoRR, Vol. abs\/1904.12043 (2019)."},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.14778\/3632093.3632095"},{"key":"e_1_2_2_45_1","first-page":"547","volume-title":"Demystifying Data Management for Large Language Models. In Companion of the 2024 International Conference on Management of Data (SIGMOD","author":"Miao Xupeng","year":"2024","unstructured":"Xupeng Miao, Zhihao Jia, and Bin Cui. 2024. Demystifying Data Management for Large Language Models. In Companion of the 2024 International Conference on Management of Data (SIGMOD 2024). ACM,, 547-555."},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452773"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-022-3581-9"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.14778\/3570690.3570697"},{"key":"e_1_2_2_49_1","volume-title":"Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters. In Architectural Support for Programming Languages and Operating Systems 2020 (ASPLOS","author":"Mo Zizhao","year":"2024","unstructured":"Zizhao Mo, Huanle Xu, and Chengzhong Xu. 2024. Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters. In Architectural Support for Programming Languages and Operating Systems 2020 (ASPLOS 2020). 499-513."},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476311.3476374"},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.14778\/3636218.3636227"},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_2_2_53_1","first-page":"7937","volume-title":"Memory-Efficient Pipeline-Parallel DNN Training. In International Conference on Machine Learning 2021 (ICML","volume":"139","author":"Narayanan Deepak","year":"2021","unstructured":"Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021a. Memory-Efficient Pipeline-Parallel DNN Training. In International Conference on Machine Learning 2021 (ICML 2021), Vol. 139. 7937-7947."},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611540.3611564"},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588964"},{"key":"e_1_2_2_57_1","unstructured":"NVIDIA. 2024a. cuBLAS. https:\/\/docs.nvidia.com\/cuda\/cublas\/."},{"key":"e_1_2_2_58_1","unstructured":"NVIDIA. 2024b. cutlass. https:\/\/github.com\/NVIDIA\/cutlass\/."},{"key":"e_1_2_2_59_1","unstructured":"NVIDIA. 2024c. NVIDIA Collective Communications Library (NCCL). https:\/\/developer.nvidia.com\/nccl."},{"key":"e_1_2_2_60_1","unstructured":"NVIDIA. 2024d. NVIDIA Resiliency Extension. https:\/\/github.com\/NVIDIA\/nvidia-resiliency-ext."},{"key":"e_1_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-024-4125-9"},{"key":"e_1_2_2_62_1","volume-title":"Proceedings of Machine Learning and Systems 2020 (MLSys","author":"Or Andrew","year":"2020","unstructured":"Andrew Or, Haoyu Zhang, and Michael J. Freedman. 2020. Resource Elasticity in Distributed Deep Learning. In Proceedings of Machine Learning and Systems 2020 (MLSys 2020)."},{"key":"e_1_2_2_63_1","first-page":"8024","volume-title":"High-Performance Deep Learning Library. In Annual Conference on Neural Information Processing Systems 2019 (NeurIPS","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K\u00f6pf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Annual Conference on Neural Information Processing Systems 2019 (NeurIPS 2019). 8024-8035."},{"key":"e_1_2_2_64_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611540.3611573"},{"key":"e_1_2_2_65_1","volume-title":"Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI","author":"Qiao Aurick","year":"2021","unstructured":"Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, and Eric P. Xing. 2021. Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2021)."},{"key":"e_1_2_2_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_2_2_67_1","volume-title":"Horovod: fast and easy distributed deep learning in TensorFlow. CoRR","author":"Sergeev Alexander","year":"2018","unstructured":"Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR, Vol. abs\/1802.05799 (2018)."},{"key":"e_1_2_2_68_1","volume-title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR, Vol. abs\/1909.08053 (2019)."},{"key":"e_1_2_2_69_1","doi-asserted-by":"publisher","DOI":"10.1145\/3617338"},{"key":"e_1_2_2_70_1","first-page":"342","volume-title":"AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators. In IEEE International Symposium on High Performance Computer Architecture, 2020","author":"Song Linghao","year":"2020","unstructured":"Linghao Song, Fan Chen, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2020. AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators. In IEEE International Symposium on High Performance Computer Architecture, 2020 (HPCA 2020). 342-355."},{"key":"e_1_2_2_71_1","doi-asserted-by":"publisher","DOI":"10.1145\/3642970.3655843"},{"key":"e_1_2_2_72_1","first-page":"24829","volume-title":"Piper: Multidimensional Planner for DNN Parallelization. In Annual Conference on Neural Information Processing Systems 2021 (NeurIPS","author":"Tarnawski Jakub","year":"2021","unstructured":"Jakub Tarnawski, Deepak Narayanan, and Amar Phanishayee. 2021. Piper: Multidimensional Planner for DNN Parallelization. In Annual Conference on Neural Information Processing Systems 2021 (NeurIPS 2021). 24829-24840."},{"key":"e_1_2_2_73_1","unstructured":"The Imbue Team. 2024. From bare metal to a 70B model: infrastructure set-up and scripts. https:\/\/imbue.com\/research\/70b-infrastructure\/."},{"key":"e_1_2_2_74_1","first-page":"497","volume-title":"Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI","author":"Thorpe John","year":"2023","unstructured":"John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, and Guoqing Harry Xu. 2023. Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2023). 497-513."},{"key":"e_1_2_2_75_1","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton-Ferrer Moya Chen Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang Angela Fan Melanie Kambadur Sharan Narang Aur\u00e9lien Rodriguez Robert Stojnic Sergey Edunov and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR Vol. abs\/2307.09288 (2023)."},{"key":"e_1_2_2_76_1","first-page":"5998","volume-title":"Annual Conference on Neural Information Processing Systems 2017 (NeurIPS","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Annual Conference on Neural Information Processing Systems 2017 (NeurIPS 2017). 5998-6008."},{"key":"e_1_2_2_77_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2024.3370614"},{"key":"e_1_2_2_78_1","doi-asserted-by":"publisher","DOI":"10.1145\/3676641.3715998"},{"key":"e_1_2_2_79_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613145"},{"key":"e_1_2_2_80_1","volume-title":"FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training. CoRR","author":"Wu Tianyuan","year":"2024","unstructured":"Tianyuan Wu, Wei Wang, Yinghao Yu, Siran Yang, Wenchao Wu, Qinkai Duan, Guodong Yang, Jiamang Wang, Lin Qu, and Liping Zhang. 2024. FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training. CoRR, Vol. abs\/2410.12588 (2024)."},{"key":"e_1_2_2_81_1","doi-asserted-by":"publisher","DOI":"10.14778\/3626292.3626303"},{"key":"e_1_2_2_82_1","volume-title":"HetHub: A Heterogeneous distributed hybrid training system for large-scale models. CoRR","author":"Xu Si","year":"2024","unstructured":"Si Xu, Zixiao Huang, Yan Zeng, Shengen Yan, Xuefei Ning, Haolin Ye, Sipei Gu, Chunsheng Shui, Zhezheng Lin, Hao Zhang, Sheng Wang, Guohao Dai, and Yu Wang. 2024. HetHub: A Heterogeneous distributed hybrid training system for large-scale models. CoRR, Vol. abs\/2405.16256 (2024)."},{"key":"e_1_2_2_83_1","first-page":"437","volume-title":"SkyPilot: An Intercloud Broker for Sky Computing. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI","author":"Yang Zongheng","year":"2023","unstructured":"Zongheng Yang, Zhanghao Wu, Michael Luo, Wei-Lin Chiang, Romil Bhardwaj, Woosuk Kwon, Siyuan Zhuang, Frank Sifei Luan, Gautam Mittal, Scott Shenker, and Ion Stoica. 2023. SkyPilot: An Intercloud Broker for Sky Computing. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2023). 437-455."},{"key":"e_1_2_2_84_1","first-page":"1283","article-title":"DimmWitted","volume":"7","author":"Zhang Ce","year":"2014","unstructured":"Ce Zhang and Christopher R\u00e9. 2014. DimmWitted: A Study of Main-Memory Statistical Analytics. Proc. VLDB Endow., Vol. 7, 12 (2014), 1283-1294.","journal-title":"A Study of Main-Memory Statistical Analytics. Proc. VLDB Endow."},{"key":"e_1_2_2_85_1","doi-asserted-by":"publisher","DOI":"10.1145\/3627703.3629580"},{"key":"e_1_2_2_86_1","doi-asserted-by":"publisher","DOI":"10.14778\/3467861.3467867"},{"key":"e_1_2_2_87_1","doi-asserted-by":"publisher","DOI":"10.14778\/3561261.3561265"},{"key":"e_1_2_2_88_1","volume-title":"Comput. Sci. Technol.","author":"Zhang Zhen-Xing","year":"2024","unstructured":"Zhen-Xing Zhang, Yuan-Bo Wen, Han-Qi Lv, Chang Liu, Rui Zhang, Xia-Qing Li, Chao Wang, Zi-Dong Du, Qi Guo, Ling Li, Xue-Hai Zhou, and Yun-Ji Chen. 2024b. AI Computing Systems for LLMs Training: A Review. J. Comput. Sci. Technol., (2024)."},{"key":"e_1_2_2_89_1","doi-asserted-by":"publisher","DOI":"10.1145\/3589773"},{"key":"e_1_2_2_90_1","first-page":"1507","volume-title":"Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning. In IEEE International Conference on Distributed Computing Systems (ICDCS","author":"Zhao Xing","year":"2019","unstructured":"Xing Zhao, Aijun An, Junfeng Liu, and Bao Xin Chen. 2019. Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning. In IEEE International Conference on Distributed Computing Systems (ICDCS 2019). 1507-1517."},{"key":"e_1_2_2_91_1","first-page":"3848","article-title":"PyTorch FSDP","volume":"16","author":"Zhao Yanli","year":"2023","unstructured":"Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. 2023a. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. Proc. VLDB Endow., Vol. 16, 12 (2023), 3848-3860.","journal-title":"Experiences on Scaling Fully Sharded Data Parallel. Proc. VLDB Endow."},{"key":"e_1_2_2_92_1","first-page":"559","volume-title":"Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. 2022. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2022). 559-578."},{"key":"e_1_2_2_93_1","doi-asserted-by":"publisher","DOI":"10.1145\/3617327"},{"key":"e_1_2_2_94_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-023-00235-6"},{"key":"e_1_2_2_95_1","volume-title":"Data Sci. Eng.","author":"Zhu Jingyu","year":"2024","unstructured":"Jingyu Zhu, Xintong Zhao, Yu Sun, Shaoxu Song, and Xiaojie Yuan. 2024. Relational Data Cleaning Meets Artificial Intelligence: A Survey. Data Sci. Eng., (2024)."}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3725322","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T18:53:11Z","timestamp":1774983191000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3725322"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,17]]},"references-count":95,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,6,17]]}},"alternative-id":["10.1145\/3725322"],"URL":"https:\/\/doi.org\/10.1145\/3725322","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,17]]}}}