{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,3]],"date-time":"2025-12-03T20:18:05Z","timestamp":1764793085850,"version":"3.44.0"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,2,16]],"date-time":"2024-02-16T00:00:00Z","timestamp":1708041600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Meas. Anal. Comput. Syst."],"published-print":{"date-parts":[[2024,2,16]]},"abstract":"<jats:p>Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, a large transformer model training today typically involves model sharding, data parallelism, and model parallelism. Thus, the throughput of large-scale model training depends heavily on the network bandwidth since a combination of model sharding and multiple parallelism strategies incurs various costs. However, prior characterizations of transformer models on high-bandwidth DGX machines that use TFLOPS as a metric may not reflect the performance of a system with lower bandwidth. Furthermore, data and model parallelism reveal significantly distinct training profiles on different system bandwidths at scale and, thus, need a thorough study. In this paper, we provide a bottom-up breakdown of training throughput into compute and communication time, and quantitatively analyze their respective influences on overall end-to-end training scaling. Our evaluation involves an in-depth exploration of data parallelism, scaling up to 512 GPUs with limited bandwidth, and examines three model sharding strategies among six model sizes. We also evaluate three combinations of model parallelism on both high and low bandwidth supercomputing systems. Overall, our work provides a broader perspective on large-scale transformer model training, and our analysis and evaluation yield practical insights for predicting training scaling, shaping the future development of supercomputing system design.<\/jats:p>","DOI":"10.1145\/3639034","type":"journal-article","created":{"date-parts":[[2024,2,21]],"date-time":"2024-02-21T17:01:32Z","timestamp":1708534892000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Thorough Characterization and Analysis of Large Transformer Model Training At-Scale"],"prefix":"10.1145","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9954-7986","authenticated-orcid":false,"given":"Scott","family":"Cheng","sequence":"first","affiliation":[{"name":"Pennsylvania State University, State College, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5105-7574","authenticated-orcid":false,"given":"Jun-Liang","family":"Lin","sequence":"additional","affiliation":[{"name":"Pennsylvania State University, State College, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6279-0007","authenticated-orcid":false,"given":"Murali","family":"Emani","sequence":"additional","affiliation":[{"name":"Argonne National Laboratory, Lemont, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4832-0834","authenticated-orcid":false,"given":"Siddhisanket","family":"Raskar","sequence":"additional","affiliation":[{"name":"Argonne National Laboratory, Lemont, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9981-0876","authenticated-orcid":false,"given":"Sam","family":"Foreman","sequence":"additional","affiliation":[{"name":"Argonne National Laboratory, Lemont, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3516-2192","authenticated-orcid":false,"given":"Zhen","family":"Xie","sequence":"additional","affiliation":[{"name":"Binghamton University, Vestal, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7248-6116","authenticated-orcid":false,"given":"Venkatram","family":"Vishwanath","sequence":"additional","affiliation":[{"name":"Argonne National Laboratory, Lemont, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9940-9951","authenticated-orcid":false,"given":"Mahmut Taylan","family":"Kandemir","sequence":"additional","affiliation":[{"name":"Pennsylvania State University, State College, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,2,21]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Lin (Eds.)","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877--1901."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3437801.3441620"},{"key":"e_1_2_1_3_1","unstructured":"Microsoft Corporation. 2022. Megatron-DeepSpeed. \"https:\/\/github.com\/microsoft\/Megatron-DeepSpeed\"."},{"key":"e_1_2_1_4_1","unstructured":"Nvidia Corporation. 2016a. NCCL Tests. \"https:\/\/github.com\/NVIDIA\/nccl-tests\"."},{"key":"e_1_2_1_5_1","unstructured":"Nvidia Corporation. 2016b. NVIDIA Collective Communications Library (NCCL). \"https:\/\/github.com\/NVIDIA\/nccl\"."},{"key":"e_1_2_1_6_1","unstructured":"Nvidia Corporation. 2016c. NVIDIA Nsight Systems. \"https:\/\/developer.nvidia.com\/nsight-systems\"."},{"key":"e_1_2_1_7_1","unstructured":"Nvidia Corporation. 2016 d. NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP). \"https:\/\/docs.nvidia.com\/networking\/display\/sharpv300\"."},{"volume-title":"2016 e","author":"Nvidia Corporation","key":"e_1_2_1_8_1","unstructured":"Nvidia Corporation. 2016 e. NVIDIA Tools Extension Library (NVTX). \"https:\/\/github.com\/NVIDIA\/NVTX\"."},{"key":"e_1_2_1_9_1","unstructured":"Nvidia Corporation. 2023. CUDA Samples. \"https:\/\/github.com\/NVIDIA\/cuda-samples\"."},{"key":"e_1_2_1_10_1","volume-title":"Gc3: An optimizing compiler for gpu collective communication. arXiv preprint arXiv:2201.11840","author":"Cowan Meghan","year":"2022","unstructured":"Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi, and Yifan Xiong. 2022. Gc3: An optimizing compiler for gpu collective communication. arXiv preprint arXiv:2201.11840 (2022)."},{"key":"e_1_2_1_11_1","first-page":"16344","article-title":"Flashattention: Fast and memory-efficient exact attention with io-awareness","volume":"35","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems , Vol. 35 (2022), 16344--16359.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_12_1","volume-title":"Gpts are gpts: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130","author":"Eloundou Tyna","year":"2023","unstructured":"Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. 2023. Gpts are gpts: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130 (2023)."},{"key":"e_1_2_1_13_1","unstructured":"GitHub. 2023. Copilot. \"https:\/\/github.com\/features\/copilot\"."},{"key":"e_1_2_1_14_1","unstructured":"Hannibal046. 2023. Awesome-LLM. \"https:\/\/github.com\/Hannibal046\/Awesome-LLM\"."},{"key":"e_1_2_1_15_1","first-page":"30016","article-title":"An empirical analysis of compute-optimal large language model training","volume":"35","author":"Hoffmann Jordan","year":"2022","unstructured":"Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems , Vol. 35 (2022), 30016--30030.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_16_1","volume-title":"Architecture and System Support for Transformer Models (ASSYST@ ISCA","author":"Isaev Mikhail","year":"2023","unstructured":"Mikhail Isaev, Nic McDonald, and Richard Vuduc. 2023. Scaling Infrastructure to Support Multi-Trillion Parameter LLM Training. In Architecture and System Support for Transformer Models (ASSYST@ ISCA 2023)."},{"key":"e_1_2_1_17_1","first-page":"711","article-title":"Data movement is all you need: A case study on optimizing transformers","volume":"3","author":"Ivanov Andrei","year":"2021","unstructured":"Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. 2021. Data movement is all you need: A case study on optimizing transformers. Proceedings of Machine Learning and Systems , Vol. 3 (2021), 711--732.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3503222.3507778"},{"key":"e_1_2_1_19_1","volume-title":"Anna Potapenko, et al.","author":"Jumper John","year":"2021","unstructured":"John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin vZ 'idek, Anna Potapenko, et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature, Vol. 596, 7873 (2021), 583--589."},{"key":"e_1_2_1_20_1","volume-title":"Scaling laws for neural language models. arXiv preprint arXiv:2001.08361","author":"Kaplan Jared","year":"2020","unstructured":"Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)."},{"key":"e_1_2_1_21_1","first-page":"2","article-title":"Bert: Pre-training of deep bidirectional transformers for language understanding","volume":"1","author":"Ming-Wei Chang Jacob Devlin","year":"2019","unstructured":"Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, Vol. 1. 2.","journal-title":"Proceedings of NAACL-HLT"},{"key":"e_1_2_1_22_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of Machine Learning and Systems","volume":"5","author":"Korthikanti Vijay Anand","year":"2023","unstructured":"Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems , Vol. 5 (2023)."},{"key":"e_1_2_1_24_1","unstructured":"Percy Liang Rishi Bommasani Tony Lee Dimitris Tsipras Dilara Soylu Michihiro Yasunaga Yian Zhang Deepak Narayanan Yuhuai Wu Ananya Kumar et al. 2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022)."},{"key":"e_1_2_1_25_1","unstructured":"Meta. 2022. PyTorch Profiler. \"https:\/\/pytorch.org\/docs\/stable\/profiler.html\"."},{"key":"e_1_2_1_26_1","doi-asserted-by":"crossref","unstructured":"Hans Meuer Erich Strohmaier Jack Dongarra and Horst Simon. 2001. Top500 supercomputer sites. \"https:\/\/www.top500.org\/\".","DOI":"10.2172\/843058"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_2_1_28_1","unstructured":"OpenAI. 2023 a. ChatGPT. \"https:\/\/chat.openai.com\"."},{"key":"e_1_2_1_29_1","unstructured":"OpenAI. 2023 b. GPT-4 Technical Report. arxiv: 2303.08774 [cs.CL]"},{"volume-title":"PyTorch: An Imperative Style","author":"Paszke Adam","key":"e_1_2_1_30_1","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, , H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch\u00e9 Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024--8035."},{"key":"e_1_2_1_31_1","volume-title":"Demystifying bert: Implications for accelerator design. arXiv preprint arXiv:2104.08335","author":"Pati Suchita","year":"2021","unstructured":"Suchita Pati, Shaizeen Aga, Nuwan Jayasena, and Matthew D Sinclair. 2021. Demystifying bert: Implications for accelerator design. arXiv preprint arXiv:2104.08335 (2021)."},{"key":"e_1_2_1_32_1","unstructured":"Jack W Rae Sebastian Borgeaud Trevor Cai Katie Millican Jordan Hoffmann Francis Song John Aslanides Sarah Henderson Roman Ring Susannah Young et al. 2021. Scaling language models: Methods analysis & insights from training gopher. arXiv preprint arXiv:2112.11446 (2021)."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476205"},{"key":"e_1_2_1_35_1","volume-title":"International Conference on Machine Learning. PMLR, 8821--8831","author":"Ramesh Aditya","year":"2021","unstructured":"Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 8821--8831."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3406703"},{"key":"e_1_2_1_37_1","volume-title":"2021 USENIX Annual Technical Conference (USENIX ATC 21)","author":"Ren Jie","year":"2021","unstructured":"Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. $$ZeRO-Offload$$: Democratizing $$Billion-Scale$$ model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 551--564."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_2_1_39_1","volume-title":"Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3575712"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS49936.2021.00109"},{"key":"e_1_2_1_42_1","volume-title":"Attention is all you need. Advances in neural information processing systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems , Vol. 30 (2017)."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3545008.3545087"},{"key":"e_1_2_1_44_1","volume-title":"Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, and Yuxiong He.","author":"Wang Guanhua","year":"2023","unstructured":"Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, and Yuxiong He. 2023. ZeRO: Extremely Efficient Collective Communication for Giant Model Training. arXiv preprint arXiv:2306.10209 (2023)."},{"key":"e_1_2_1_45_1","unstructured":"Jason Wei Yi Tay Rishi Bommasani Colin Raffel Barret Zoph Sebastian Borgeaud Dani Yogatama Maarten Bosma Denny Zhou Donald Metzler et al. 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022)."},{"key":"e_1_2_1_46_1","volume-title":"ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats. arXiv preprint arXiv:2307.09782","author":"Wu Xiaoxia","year":"2023","unstructured":"Xiaoxia Wu, Zhewei Yao, and Yuxiong He. 2023. ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats. arXiv preprint arXiv:2307.09782 (2023)."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS54959.2023.00031"},{"key":"e_1_2_1_48_1","volume-title":"ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation. arXiv preprint arXiv:2303.08302","author":"Yao Zhewei","year":"2023","unstructured":"Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, and Yuxiong He. 2023. ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation. arXiv preprint arXiv:2303.08302 (2023)."},{"key":"e_1_2_1_49_1","doi-asserted-by":"crossref","unstructured":"Yanli Zhao Andrew Gu Rohan Varma Liang Luo Chien-Chin Huang Min Xu Less Wright Hamid Shojanazeri Myle Ott Sam Shleifer et al. 2023. Pytorch FSDP: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277 (2023).","DOI":"10.14778\/3611540.3611569"},{"key":"e_1_2_1_50_1","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and $$Intra-Operator$$ parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559--578."},{"key":"e_1_2_1_51_1","volume-title":"Proceedings of Machine Learning and Systems","volume":"5","author":"Zhuang Yonghao","year":"2023","unstructured":"Yonghao Zhuang, Lianmin Zheng, Zhuohan Li, Eric Xing, Qirong Ho, Joseph Gonzalez, Ion Stoica, Hao Zhang, and Hexu Zhao. 2023. On optimizing the communication of model parallelism. Proceedings of Machine Learning and Systems , Vol. 5 (2023)."},{"key":"e_1_2_1_52_1","volume-title":"Austin Clyde, Bharat Kale, Danilo Perez-Rivera, Heng Ma, et al.","author":"Zvyagin Maxim","year":"2022","unstructured":"Maxim Zvyagin, Alexander Brace, Kyle Hippe, Yuntian Deng, Bin Zhang, Cindy Orozco Bohorquez, Austin Clyde, Bharat Kale, Danilo Perez-Rivera, Heng Ma, et al. 2022. GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. bioRxiv (2022), 2022--10. io"}],"container-title":["Proceedings of the ACM on Measurement and Analysis of Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3639034","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3639034","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,23]],"date-time":"2025-08-23T01:38:13Z","timestamp":1755913093000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3639034"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,16]]},"references-count":52,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,2,16]]}},"alternative-id":["10.1145\/3639034"],"URL":"https:\/\/doi.org\/10.1145\/3639034","relation":{},"ISSN":["2476-1249"],"issn-type":[{"type":"electronic","value":"2476-1249"}],"subject":[],"published":{"date-parts":[[2024,2,16]]},"assertion":[{"value":"2024-02-21","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}