{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,19]],"date-time":"2025-12-19T15:54:22Z","timestamp":1766159662680,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":89,"publisher":"ACM","license":[{"start":{"date-parts":[[2025,3,30]],"date-time":"2025-03-30T00:00:00Z","timestamp":1743292800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,3,30]]},"DOI":"10.1145\/3689031.3717461","type":"proceedings-article","created":{"date-parts":[[2025,3,26]],"date-time":"2025-03-26T06:25:20Z","timestamp":1742970320000},"page":"1298-1316","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7433-7384","authenticated-orcid":false,"given":"Zhanda","family":"Zhu","sequence":"first","affiliation":[{"name":"University of Toronto, Vector, Institute, CentML"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0162-4547","authenticated-orcid":false,"given":"Christina","family":"Giannoula","sequence":"additional","affiliation":[{"name":"University of Toronto, Vector, Institute, CentML"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-6033-369X","authenticated-orcid":false,"given":"Muralidhar","family":"Andoorveedu","sequence":"additional","affiliation":[{"name":"CentML"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-2612-6634","authenticated-orcid":false,"given":"Qidong","family":"Su","sequence":"additional","affiliation":[{"name":"University of Toronto, Vector, Institute, CentML"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2169-1395","authenticated-orcid":false,"given":"Karttikeya","family":"Mangalam","sequence":"additional","affiliation":[{"name":"SigIQ.ai"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1999-2359","authenticated-orcid":false,"given":"Bojian","family":"Zheng","sequence":"additional","affiliation":[{"name":"University of Toronto, Vector, Institute, CentML"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3839-0919","authenticated-orcid":false,"given":"Gennady","family":"Pekhimenko","sequence":"additional","affiliation":[{"name":"University of Toronto, Vector, Institute, CentML"}]}],"member":"320","published-online":{"date-parts":[[2025,3,30]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"12th USENIX symposium on operating systems design and implementation (OSDI 16)","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. {TensorFlow}: a system for {Large-Scale} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 265--283."},{"key":"e_1_3_2_1_2_1","unstructured":"Ebtesam Almazrouei Hamza Alobeidli Abdulaziz Alshamsi Alessandro Cappelli Ruxandra Cojocaru M\u00e9rouane Debbah \u00c9tienne Goffinet Daniel Hesslow Julien Launay Quentin Malartic et al. 2023. The falcon series of open language models. arXiv preprint arXiv:2311.16867 (2023)."},{"key":"e_1_3_2_1_3_1","first-page":"12267","article-title":"Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction","volume":"35","author":"Andoorveedu Muralidhar","year":"2022","unstructured":"Muralidhar Andoorveedu, Zhanda Zhu, Bojian Zheng, and Gennady Pekhimenko. 2022. Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction. Advances in Neural Information Processing Systems 35 (2022), 12267--12282.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"crossref","unstructured":"Jason Ansel Edward Yang Horace He Natalia Gimelshein Animesh Jain Michael Voznesensky Bin Bao David Berard Geeta Chauhan Anjali Chourdia et al. 2024. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. (2024). To appear at ASPLOS.","DOI":"10.1145\/3620665.3640366"},{"key":"e_1_3_2_1_5_1","unstructured":"Anthropic. 2024. Claude. https:\/\/claude.ai\/."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3492321.3519584"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3477132.3483553"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.bigscience-1.9"},{"key":"e_1_3_2_1_9_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901."},{"key":"e_1_3_2_1_10_1","first-page":"209","article-title":"Klee: unassisted and automatic generation of high-coverage tests for complex systems programs","volume":"8","author":"Cadar Cristian","year":"2008","unstructured":"Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. 2008. Klee: unassisted and automatic generation of high-coverage tests for complex systems programs.. In OSDI, Vol. 8. 209--224.","journal-title":"OSDI"},{"key":"e_1_3_2_1_11_1","volume-title":"The Tenth International Conference on Learning Representations, ICLR.","author":"Chen Beidi","year":"2022","unstructured":"Beidi Chen, Tri Dao, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, and Christopher R\u00e9. 2022. Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models. In The Tenth International Conference on Learning Representations, ICLR."},{"key":"e_1_3_2_1_12_1","volume-title":"Shuai Zheng, Zhen Zhang, Zhiru Zhang, and Yida Wang.","author":"Chen Hongzheng","year":"2023","unstructured":"Hongzheng Chen, Cody Hao Yu, Shuai Zheng, Zhen Zhang, Zhiru Zhang, and Yida Wang. 2023. Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training. arXiv:2302.08005 [cs.LG]"},{"key":"e_1_3_2_1_13_1","volume-title":"International Conference on Machine Learning (ICML).","author":"Chen Jianfei","year":"2021","unstructured":"Jianfei Chen, Lianmin Zheng, Zhewei Yao, Dequan Wang, Ion Stoica, Michael W Mahoney, and Joseph E Gonzalez. 2021. ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training. In International Conference on Machine Learning (ICML)."},{"key":"e_1_3_2_1_14_1","volume-title":"Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al.","author":"Chen Mark","year":"2021","unstructured":"Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)."},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939785"},{"key":"e_1_3_2_1_16_1","volume-title":"Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation.","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: an automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation."},{"key":"e_1_3_2_1_17_1","volume-title":"Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174","author":"Chen Tianqi","year":"2016","unstructured":"Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016)."},{"key":"e_1_3_2_1_18_1","unstructured":"Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. (2023)."},{"key":"e_1_3_2_1_19_1","unstructured":"Tri Dao Daniel Y. Fu Stefano Ermon Atri Rudra and Christopher R\u00e9. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Advances in Neural Information Processing Systems."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3575702"},{"key":"e_1_3_2_1_21_1","volume-title":"Proteus: Simulating the Performance of Distributed DNN Training. arXiv preprint arXiv:2306.02267","author":"Duan Jiangfei","year":"2023","unstructured":"Jiangfei Duan, Xiuhong Li, Ping Xu, Xingcheng Zhang, Shengen Yan, Yun Liang, and Dahua Lin. 2023. Proteus: Simulating the Performance of Distributed DNN Training. arXiv preprint arXiv:2306.02267 (2023)."},{"key":"e_1_3_2_1_22_1","unstructured":"facebookresearch\/llama. 2023. llama\/MODEL_CARD.md. https:\/\/github.com\/facebookresearch\/llarna\/blob\/rnain\/MODEL_CARD.rnd."},{"key":"e_1_3_2_1_23_1","unstructured":"facebookresearch\/llama. 2024. llama3. https:\/\/github.com\/meta-llama\/llama3."},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/263580.263648"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3437801.3441593"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3575703"},{"volume-title":"Symbolic analysis in analog integrated circuit design","author":"Floberg Henrik","key":"e_1_3_2_1_27_1","unstructured":"Henrik Floberg. 2012. Symbolic analysis in analog integrated circuit design. Vol. 413. Springer Science & Business Media."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"crossref","unstructured":"John Forrest and Robin Lougee-Heimer. 2005. CBC user guide. In Emerging theory methods and applications. INFORMS 257--277.","DOI":"10.1287\/educ.1053.0020"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.265355"},{"key":"e_1_3_2_1_30_1","unstructured":"Google. 2022. XLA. https:\/\/www.tensorflow.org\/xla."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"crossref","unstructured":"Cong Guo Rui Zhang Jiale Xu Jingwen Leng Zihan Liu Ziyu Huang Minyi Guo Hao Wu Shouren Zhao Junping Zhao et al. 2024. GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching. arXiv preprint arXiv:2401.08156 (2024).","DOI":"10.1145\/3620665.3640423"},{"key":"e_1_3_2_1_32_1","volume-title":"Pipetransformer: Automated elastic pipelining for distributed training of transformers. arXiv preprint arXiv:2102.03161","author":"He Chaoyang","year":"2021","unstructured":"Chaoyang He, Shen Li, Mahdi Soltanolkotabi, and Salman Avestimehr. 2021. Pipetransformer: Automated elastic pipelining for distributed training of transformers. arXiv preprint arXiv:2102.03161 (2021)."},{"key":"e_1_3_2_1_33_1","volume-title":"Proceedings of Machine Learning and Systems","author":"He Horace","year":"2023","unstructured":"Horace He and Shangdi Yu. 2023. Transcending Runtime-Memory Tradeoffs in Checkpointing by being Fusion Aware. Proceedings of Machine Learning and Systems (2023)."},{"key":"e_1_3_2_1_34_1","volume-title":"Proceedings of Machine Learning and Systems","author":"Hu Hanpeng","year":"2022","unstructured":"Hanpeng Hu, Chenyu Jiang, Yuchen Zhong, Yanghua Peng, Chuan Wu, Yibo Zhu, Haibin Lin, and Chuanxiong Guo. 2022. dpro: A generic performance diagnosis and optimization toolkit for expediting distributed dnn training. Proceedings of Machine Learning and Systems (2022)."},{"key":"e_1_3_2_1_35_1","volume-title":"Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_3_2_1_36_1","first-page":"497","article-title":"Checkmate: Breaking the memory wall with optimal tensor rematerialization","volume":"2","author":"Jain Paras","year":"2020","unstructured":"Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems 2 (2020), 497--511.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_37_1","volume-title":"torchgpipe: On-the-fly pipeline parallelism for training giant models. arXiv preprint arXiv:2004.09910","author":"Kim Chiheon","year":"2020","unstructured":"Chiheon Kim, Heungsub Lee, Myungryong Jeong, Woonhyuk Baek, Boogeon Yoon, Ildoo Kim, Sungbin Lim, and Sungwoong Kim. 2020. torchgpipe: On-the-fly pipeline parallelism for training giant models. arXiv preprint arXiv:2004.09910 (2020)."},{"key":"e_1_3_2_1_38_1","volume-title":"Dynamic Tensor Rematerialization. In International Conference on Learning Representations.","author":"Kirisame Marisa","year":"2020","unstructured":"Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, and Zachary Tatlock. 2020. Dynamic Tensor Rematerialization. In International Conference on Learning Representations."},{"key":"e_1_3_2_1_39_1","volume-title":"Proceedings of Machine Learning and Systems 5","author":"Korthikanti Vijay Anand","year":"2023","unstructured":"Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems 5 (2023)."},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2023.3247001"},{"key":"e_1_3_2_1_41_1","first-page":"6630","article-title":"Amp: Automatically finding model parallel strategies with heterogeneity awareness","volume":"35","author":"Li Dacheng","year":"2022","unstructured":"Dacheng Li, Hongyi Wang, Eric Xing, and Hao Zhang. 2022. Amp: Automatically finding model parallel strategies with heterogeneity awareness. Advances in Neural Information Processing Systems 35 (2022), 6630--6639.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_42_1","unstructured":"Shen Li Yanli Zhao Rohan Varma Omkar Salpekar Pieter Noordhuis Teng Li Adam Paszke Jeff Smith Brian Vaughan Pritam Damania et al. 2020. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704 (2020)."},{"key":"e_1_3_2_1_43_1","volume-title":"International Conference on Machine Learning. PMLR, 6543--6552","author":"Li Zhuohan","year":"2021","unstructured":"Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. 2021. Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning. PMLR, 6543--6552."},{"key":"e_1_3_2_1_44_1","volume-title":"Proceedings of Machine Learning and Systems","author":"Lin Bin","year":"2023","unstructured":"Bin Lin, Ningxin Zheng, Lei Wang, Shijie Cao, Lingxiao Ma, Quanlu Zhang, Yi Zhu, Ting Cao, Jilong Xue, Yuqing Yang, et al. 2023. Efficient GPU Kernels for N: M-Sparse Weights in Deep Learning. Proceedings of Machine Learning and Systems (2023)."},{"key":"e_1_3_2_1_45_1","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Lin Zhiqi","year":"2024","unstructured":"Zhiqi Lin, Youshan Miao, Quanlu Zhang, Fan Yang, Yi Zhu, Cheng Li, Saeed Maleki, Xu Cao, Ning Shang, Yilei Yang, et al. 2024. {nnScaler}:{Constraint-Guided} Parallelization Plan Generation for Deep Learning Training. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 347--363."},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3627703.3629554"},{"key":"e_1_3_2_1_47_1","volume-title":"International Conference on Machine Learning. PMLR, 14139--14152","author":"Liu Xiaoxuan","year":"2022","unstructured":"Xiaoxuan Liu, Lianmin Zheng, Dequan Wang, Yukuo Cen, Weize Chen, Xu Han, Jianfei Chen, Zhiyuan Liu, Jie Tang, Joey Gonzalez, et al. 2022. Gact: Activation compressed training for generic network architectures. In International Conference on Machine Learning. PMLR, 14139--14152."},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3587135.3592200"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2017.29"},{"key":"e_1_3_2_1_50_1","unstructured":"meta llama. 2024. llama3.1. https:\/\/ai.meta.com\/blog\/meta-llama-3-1\/."},{"key":"e_1_3_2_1_51_1","unstructured":"meta llama\/llama3. 2024. llama\/MODEL_CARD.md. https:\/\/github.com\/meta-llama\/models\/llama3_1\/blob\/main\/MODEL_CARD.md."},{"key":"e_1_3_2_1_52_1","unstructured":"meta llama\/llama3. 2024. llama\/MODEL_CARD.md. https:\/\/github.com\/meta-llama\/models\/llama3\/blob\/main\/MODEL_CARD.md."},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.14778\/3570690.3570697"},{"key":"e_1_3_2_1_54_1","volume-title":"Mixed Precision Training. In 6th International Conference on Learning Representations, ICLR.","author":"Micikevicius Paulius","year":"2018","unstructured":"Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David Garc\u00eda, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. In 6th International Conference on Learning Representations, ICLR."},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_2_1_56_1","volume-title":"International Conference on Machine Learning. PMLR, 7937--7947","author":"Narayanan Deepak","year":"2021","unstructured":"Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning. PMLR, 7937--7947."},{"key":"e_1_3_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_3_2_1_58_1","unstructured":"NVIDIA. [n.d.]. NVIDIA A100 GPUs. https:\/\/www.nvidia.com\/en-us\/data-center\/a100\/."},{"key":"e_1_3_2_1_59_1","unstructured":"NVIDIA. 2023. NVIDIA L4 GPUs. https:\/\/www.nvidia.com\/en-us\/data-center\/l4\/."},{"key":"e_1_3_2_1_60_1","unstructured":"openai. 2022. ChatGPT. https:\/\/openai.com\/chatgpt\/."},{"key":"e_1_3_2_1_61_1","volume-title":"Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_3_2_1_62_1","unstructured":"Houwen Peng Kan Wu Yixuan Wei Guoshuai Zhao Yuxiang Yang Ze Liu Yifan Xiong Ziyue Yang Bolin Ni Jingcheng Hu et al. 2023. Fp8-lm: Training fp8 large language models. arXiv preprint arXiv:2310.18313 (2023)."},{"key":"e_1_3_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378505"},{"key":"e_1_3_2_1_64_1","volume-title":"Zero Bubble Pipeline Parallelism. arXiv preprint arXiv:2401.10241","author":"Qi Penghui","year":"2023","unstructured":"Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2023. Zero Bubble Pipeline Parallelism. arXiv preprint arXiv:2401.10241 (2023)."},{"key":"e_1_3_2_1_65_1","volume-title":"International conference on machine learning. PMLR","author":"Rajbhandari Samyam","year":"2022","unstructured":"Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International conference on machine learning. PMLR, 18332--18346."},{"key":"e_1_3_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_3_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476205"},{"key":"e_1_3_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3406703"},{"key":"e_1_3_2_1_69_1","volume-title":"2021 USENIX Annual Technical Conference (USENIX ATC 21)","author":"Ren Jie","year":"2021","unstructured":"Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. {ZeRO-Offload}: Democratizing {Billion-Scale} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 551--564."},{"key":"e_1_3_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783721"},{"key":"e_1_3_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1145\/3437984.3458829"},{"key":"e_1_3_2_1_72_1","volume-title":"Glu variants improve transformer. arXiv preprint arXiv:2002.05202","author":"Shazeer Noam","year":"2020","unstructured":"Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 (2020)."},{"key":"e_1_3_2_1_73_1","volume-title":"Megatron-lm: Training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)."},{"key":"e_1_3_2_1_74_1","volume-title":"Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243","author":"Strubell Emma","year":"2019","unstructured":"Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243 (2019)."},{"key":"e_1_3_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2023.127063"},{"key":"e_1_3_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620666.3651359"},{"key":"e_1_3_2_1_77_1","volume-title":"Piper: Multidimensional Planner for DNN Parallelization. In Neural Information Processing Systems. https:\/\/api.semanticscholar.org\/CorpusID:244711821","author":"Tarnawski Jakub","year":"2021","unstructured":"Jakub Tarnawski, Deepak Narayanan, and Amar Phanishayee. 2021. Piper: Multidimensional Planner for DNN Parallelization. In Neural Information Processing Systems. https:\/\/api.semanticscholar.org\/CorpusID:244711821"},{"key":"e_1_3_2_1_78_1","volume-title":"Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, MarieAnne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)."},{"key":"e_1_3_2_1_79_1","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)."},{"key":"e_1_3_2_1_80_1","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Unger Colin","year":"2022","unstructured":"Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, et al. 2022. Unity: Accelerating {DNN} training through joint optimization of algebraic transformations and parallelization. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 267--284."},{"key":"e_1_3_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178487.3178491"},{"key":"e_1_3_2_1_82_1","doi-asserted-by":"publisher","DOI":"10.1145\/3302424.3303953"},{"key":"e_1_3_2_1_83_1","volume-title":"Proceedings of Machine Learning and Systems 5","author":"Wang Zhuang","year":"2023","unstructured":"Zhuang Wang, Xinyu Wu, Zhaozhuo Xu, and TS Ng. 2023. Cupcake: A Compression Scheduler for Scalable Communication-Efficient Distributed Training. Proceedings of Machine Learning and Systems 5 (2023)."},{"key":"e_1_3_2_1_84_1","volume-title":"2021 USENIX Annual Technical Conference (USENIX ATC 21)","author":"Yu Geoffrey X","year":"2021","unstructured":"Geoffrey X Yu, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko. 2021. Habitat: A {Runtime-Based} computational performance predictor for deep neural network training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21)."},{"key":"e_1_3_2_1_85_1","volume-title":"2024 USENIX Annual Technical Conference (USENIX ATC 24)","author":"Yuan Tailing","year":"2024","unstructured":"Tailing Yuan, Yuliang Liu, Xucheng Ye, Shenglong Zhang, Jianchao Tan, Bin Chen, Chengru Song, and Di Zhang. 2024. Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism. In 2024 USENIX Annual Technical Conference (USENIX ATC 24). USENIX Association, Santa Clara, CA, 545--561. https:\/\/www.usenix.org\/conference\/atc24\/presentation\/yuan"},{"key":"e_1_3_2_1_86_1","volume-title":"Root mean square layer normalization. Advances in Neural Information Processing Systems 32","author":"Zhang Biao","year":"2019","unstructured":"Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems 32 (2019)."},{"key":"e_1_3_2_1_87_1","doi-asserted-by":"crossref","unstructured":"Yanli Zhao Andrew Gu Rohan Varma Liang Luo Chien-Chin Huang Min Xu Less Wright Hamid Shojanazeri Myle Ott Sam Shleifer et al. 2023. Pytorch FSDP: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277 (2023).","DOI":"10.14778\/3611540.3611569"},{"key":"e_1_3_2_1_88_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00092"},{"key":"e_1_3_2_1_89_1","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559--578."}],"event":{"name":"EuroSys '25: Twentieth European Conference on Computer Systems","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems"],"location":"Rotterdam Netherlands","acronym":"EuroSys '25"},"container-title":["Proceedings of the Twentieth European Conference on Computer Systems"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3689031.3717461","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3689031.3717461","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T11:20:31Z","timestamp":1755775231000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3689031.3717461"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,30]]},"references-count":89,"alternative-id":["10.1145\/3689031.3717461","10.1145\/3689031"],"URL":"https:\/\/doi.org\/10.1145\/3689031.3717461","relation":{},"subject":[],"published":{"date-parts":[[2025,3,30]]},"assertion":[{"value":"2025-03-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}