{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T10:20:04Z","timestamp":1775038804588,"version":"3.50.1"},"reference-count":45,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2026,3,16]],"date-time":"2026-03-16T00:00:00Z","timestamp":1773619200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.springernature.com\/gp\/researchers\/text-and-data-mining"},{"start":{"date-parts":[[2026,3,16]],"date-time":"2026-03-16T00:00:00Z","timestamp":1773619200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.springernature.com\/gp\/researchers\/text-and-data-mining"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62025208"],"award-info":[{"award-number":["62025208"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62421002"],"award-info":[{"award-number":["62421002"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["CCF Trans. HPC"],"published-print":{"date-parts":[[2026,4]]},"DOI":"10.1007\/s42514-025-00271-w","type":"journal-article","created":{"date-parts":[[2026,3,16]],"date-time":"2026-03-16T10:27:48Z","timestamp":1773656868000},"page":"221-236","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Parallelsim: an accurate, generic, and efficient simulator for distributed deep learning"],"prefix":"10.1007","volume":"8","author":[{"given":"Peng","family":"Liang","sequence":"first","affiliation":[]},{"given":"Linbo","family":"Qiao","sequence":"additional","affiliation":[]},{"given":"Zhiquan","family":"Lai","sequence":"additional","affiliation":[]},{"given":"Dongsheng","family":"Li","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2026,3,16]]},"reference":[{"key":"271_CR1","doi-asserted-by":"publisher","unstructured":"Ansel, J, Yang, E, He, H, Gimelshein, N, Jain, A, Voznesensky, M., Bao, B, Bell, P, Berard, D, Burovski, E., Chauhan, G, Chourdia, A, Constable, W., Desmaison, A, DeVito, Z, Ellison, E, Feng, W., Gong, J, Gschwind, M, Hirsh, B., Huang, S, Kalambarkar, K, Kirsch, L, Lazos, M, Lezcano, M., Liang, Y, Liang, J, Lu, Y, Luk, C.K, Maher, B., Pan, Y, Puhrsch, C, Reso, M, Saroufim, M, Siraichi, M.Y, Suk, H, Zhang, S, Suo, M, Tillet, P, Zhao, X, Wang, E, Zhou, K, Zou, R, Wang, X, Mathews, A, Wen, W, Chanan, G, Wu, P, Chintala, S: Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. ASPLOS \u201924, pp. 929\u2013947. Association for Computing Machinery, New York, NY, USA (2024). https:\/\/doi.org\/10.1145\/3620665.3640366","DOI":"10.1145\/3620665.3640366"},{"key":"271_CR2","doi-asserted-by":"publisher","unstructured":"Brown, T.B, Mann, B, Ryder, N, Subbiah, M, Kaplan, J, Dhariwal, P, Neelakantan, A, Shyam, P, Sastry, G, Askell, A, Agarwal, S, Herbert-Voss, A, Krueger, G, Henighan, T, Child, R, Ramesh, A, Ziegler, D.M, Wu, J, Winter, C, Hesse, C, Chen, M, Sigler, E, Litwin, M, Gray, S, Chess, B, Clark, J, Berner, C, McCandlish, S, Radford, A, Sutskever, I, Amodei, D: Language models are few-shot learners. In: Larochelle, H, Ranzato, M, Hadsell, R, Balcan, M, Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, Virtual (2020). https:\/\/doi.org\/10.48550\/arXiv.2005.14165","DOI":"10.48550\/arXiv.2005.14165"},{"key":"271_CR3","unstructured":"Chen, Z, Deng, Y, Wu, Y, Gu, Q, Li, Y: Towards understanding the mixture-of-experts layer in deep learning. In: Koyejo, S, Mohamed, S, Agarwal, A, Belgrave, D, Cho, K, Oh, A. (eds.) Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 (2022). http:\/\/papers.nips.cc\/paper_files\/paper\/2022\/hash\/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html"},{"key":"271_CR4","doi-asserted-by":"crossref","unstructured":"Dao, T, Fu, D.Y, Ermon, S, Rudra, A, R\u00e9, C: FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)","DOI":"10.52202\/068431-1189"},{"key":"271_CR5","doi-asserted-by":"publisher","unstructured":"DeepSeek-AI, Liu, A, Feng, B, Wang, B, Wang, B, Liu, B, Zhao, C, Deng, C, Ruan, C, Dai, D, Guo, D, Yang, D, Chen, D, Ji, D, Li, E, Lin, F, Luo, F, Hao, G, Chen, G, Li, G, Zhang, H, Xu, H, Yang, H, Zhang, H, Ding, H, Xin, H, Gao, H, Li, H, Qu, H, Cai, J.L, Liang, J, Guo, J, Ni, J, Li, J, Chen, J, Yuan, J, Qiu, J, Song, J, Dong, K, Gao, K, Guan, K, Wang, L, Zhang, L, Xu, L, Xia, L, Zhao, L, Zhang, L, Li, M, Wang, M, Zhang, M, Zhang, M, Tang, M, Li, M, Tian, N, Huang, P, Wang, P, Zhang, P, Zhu, Q, Chen, Q, Du, Q., Chen, R.J, Jin, R.L, Ge, R, Pan, R, Xu, R, Chen, R, Li, S.S, Lu, S, Zhou, S, Chen, S, Wu, S, Ye, S, Ma, S, Wang, S, Zhou, S, Yu, S, Zhou, S, Zheng, S, Wang, T, Pei, T, Yuan, T, Sun, T, Xiao, W.L, Zeng, W, An, W, Liu, W, Liang, W, Gao, W, Zhang, W, Li, X.Q, Jin, X, Wang, X, Bi, X, Liu, X, Wang, X, Shen, X, Chen, X, Chen, X, Nie, X, Sun, X.: Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. CoRR abs\/2405.04434 (2024) https:\/\/doi.org\/10.48550\/ARXIV.2405.04434arXiv:2405.04434","DOI":"10.48550\/ARXIV.2405.04434"},{"key":"271_CR6","doi-asserted-by":"publisher","unstructured":"Devlin, J, Chang, M, Lee, K, Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171\u20134186 (2019). https:\/\/doi.org\/10.18653\/V1\/N19-1423","DOI":"10.18653\/V1\/N19-1423"},{"issue":"10","key":"271_CR7","doi-asserted-by":"publisher","first-page":"1867","DOI":"10.1109\/TPDS.2024.3443255","volume":"35","author":"J Duan","year":"2024","unstructured":"Duan, J., Li, X., Xu, P., Zhang, X., Yan, S., Liang, Y., Lin, D.: Proteus: Simulating the performance of distributed dnn training. IEEE Trans. Parallel Distrib. Syst. 35(10), 1867\u20131878 (2024). https:\/\/doi.org\/10.1109\/TPDS.2024.3443255","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"271_CR8","doi-asserted-by":"publisher","unstructured":"Dubey, A, Jauhri, A, Pandey, A, Kadian, A, Al-Dahle, A, Letman, A, Mathur, A, Schelten, A, Yang, A, Fan, A, Goyal, A, Hartshorn, A, Yang, A, Mitra, A, Sravankumar, A, Korenev, A, Hinsvark, A, Rao, A, Zhang, A., Rodriguez, A, Gregerson, A, Spataru, A, Rozi\u00e8re, B, Biron, B, Tang, B, Chern, B, Caucheteux, C, Nayak, C, Bi, C, Marra, C, McConnell, C, Keller, C, Touret, C, Wu, C, Wong, C, Ferrer, C.C, Nikolaidis, C, Allonsius, D, Song, D, Pintz, D, Livshits, D, Esiobu, D, Choudhary, D, Mahajan, D, Garcia-Olano, D, Perino, D, Hupkes, D, Lakomkin, E, AlBadawy, E, Lobanova, E, Dinan, E., Smith, E.M., Radenovic, F, Zhang, F, Synnaeve, G, Lee, G, Anderson, G.L, Nail, G, Mialon, G, Pang, G, Cucurell, G, Nguyen, H, Korevaar, H, Xu, H, Touvron, H, Zarov, I, Ibarra, I.A, Kloumann, I.M, Misra, I, Evtimov, I, Copet, J, Lee, J, Geffert, J, Vranes, J, Park, J, Mahadeokar, J, Shah, J, Linde, J, Billock, J, Hong, J, Lee, J, Fu, J, Chi, J, Huang, J, Liu, J, Wang, J, Yu, J, Bitton, J, Spisak, J, Park, J, Rocca, J, Johnstun, J, Saxe, J, Jia, J, Alwala, K.V, Upasani, K, Plawiak, K, Li, K, Heafield, K, Stone, K, al.: The llama 3 herd of models. Arxiv abs\/2407.21783 (2024) https:\/\/doi.org\/10.48550\/ARXIV.2407.21783","DOI":"10.48550\/ARXIV.2407.21783"},{"key":"271_CR9","doi-asserted-by":"publisher","unstructured":"Fan, S, Rong, Y, Meng, C, Cao, Z, Wang, S, Zheng, Z, Wu, C, Long, G, Yang, J, Xia, L, Diao, L, Liu, X, Lin, W.: DAPPLE: a pipelined data parallel approach for training large models. In: Lee, J., Petrank, E. (eds.) PPoPP \u201921: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Virtual Event, Republic of Korea, February 27- March 3, 2021, pp. 431\u2013445 (2021). https:\/\/doi.org\/10.1145\/3437801.3441593","DOI":"10.1145\/3437801.3441593"},{"issue":"120","key":"271_CR10","first-page":"1","volume":"23","author":"W Fedus","year":"2022","unstructured":"Fedus, W., Zoph, B., Shazeer, N.: Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 23(120), 1\u201339 (2022)","journal-title":"J. Mach. Learn. Res."},{"key":"271_CR11","doi-asserted-by":"publisher","unstructured":"He, J, Qiu, J, Zeng, A, Yang, Z, Zhai, J, Tang, J.: Fastmoe: A fast mixture-of-expert training system. CoRR abs\/2103.13262 (2021) https:\/\/doi.org\/10.48550\/ARXIV.2103.13262","DOI":"10.48550\/ARXIV.2103.13262"},{"key":"271_CR12","unstructured":"He, H.: Strangely, Matrix Multiplications on GPUs Run Faster When Given \"Predictable\" Data. [Online]. Available: https:\/\/www.thonking.ai\/p\/strangely-matrix-multiplications (2024)"},{"key":"271_CR13","unstructured":"Huang, Y, Cheng, Y, Bapna, A, Firat, O, Chen, D, Chen, M.X, Lee, H, Ngiam, J, Le, Q.V, Wu, Y, Chen, Z.: Gpipe: Efficient training of giant neural networks using pipeline parallelism. In: Wallach, H.M., Larochelle, H., Beygelzimer, A, d\u2019Alch\u00e9-Buc, F, Fox, E.B, Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 103\u2013112 (2019)"},{"key":"271_CR14","unstructured":"Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015. JMLR Workshop and Conference Proceedings, vol. 37, pp. 448\u2013456. JMLR.org, ??? (2015)"},{"key":"271_CR15","doi-asserted-by":"publisher","unstructured":"Jacobs, S.A, Tanaka, M, Zhang, C, Zhang, M, Aminadabi, R.Y, Song, S.L, Rajbhandari, S, He, Y: System optimizations for enabling training of extreme long sequence transformer models. In: Proceedings of the 43rd ACM Symposium on Principles of Distributed Computing. PODC \u201924, pp. 121\u2013130. Association for Computing Machinery, New York, NY, USA (2024). https:\/\/doi.org\/10.1145\/3662158.3662806","DOI":"10.1145\/3662158.3662806"},{"key":"271_CR16","unstructured":"Korthikanti, V.A, Casper, J, Lym, S, McAfee, L, Andersch, M, Shoeybi, M, Catanzaro, B: Reducing activation recomputation in large transformer models (2023)"},{"key":"271_CR17","doi-asserted-by":"publisher","unstructured":"Lai, Z, Li, S, et al.: Merak: An efficient distributed dnn training framework with automated 3d parallelism for giant foundation models. IEEE Trans. Parallel Distrib. Syst., 1\u201313 (2023) https:\/\/doi.org\/10.1109\/TPDS.2023.3247001","DOI":"10.1109\/TPDS.2023.3247001"},{"issue":"12","key":"271_CR18","doi-asserted-by":"publisher","first-page":"2802","DOI":"10.1109\/TPDS.2020.3003307","volume":"31","author":"M Langer","year":"2020","unstructured":"Langer, M., He, Z., Rahayu, W., Xue, Y.: Distributed training of deep learning models: A taxonomic perspective. IEEE Trans. Parallel Distrib. Syst. 31(12), 2802\u20132818 (2020). https:\/\/doi.org\/10.1109\/TPDS.2020.3003307","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"issue":"7553","key":"271_CR19","doi-asserted-by":"publisher","first-page":"436","DOI":"10.1038\/NATURE14539","volume":"521","author":"Y LeCun","year":"2015","unstructured":"LeCun, Y., Bengio, Y., Hinton, G.E.: Deep learning. Nat. 521(7553), 436\u2013444 (2015). https:\/\/doi.org\/10.1038\/NATURE14539","journal-title":"Deep learning. Nat."},{"key":"271_CR20","doi-asserted-by":"publisher","unstructured":"Li, S, Hoefler, T: Chimera: efficiently training large-scale neural networks with bidirectional pipelines. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC \u201921. Association for Computing Machinery, New York, NY, USA (2021). https:\/\/doi.org\/10.1145\/3458817.3476145","DOI":"10.1145\/3458817.3476145"},{"key":"271_CR21","doi-asserted-by":"publisher","unstructured":"Li, S, Zhao, Y, Varma, R, Salpekar, O, Noordhuis, P, Li, T, Paszke, A, Smith, J, Vaughan, B, Damania, P, Chintala, S: Pytorch distributed: experiences on accelerating data parallel training. Proc. VLDB Endow. 13(12), 3005\u20133018 (2020) https:\/\/doi.org\/10.14778\/3415478.3415530","DOI":"10.14778\/3415478.3415530"},{"issue":"1","key":"271_CR22","doi-asserted-by":"publisher","first-page":"94","DOI":"10.1109\/TPDS.2019.2928289","volume":"31","author":"A Li","year":"2020","unstructured":"Li, A., Song, S.L., Chen, J., Li, J., Liu, X., Tallent, N.R., Barker, K.J.: Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect. IEEE Trans. Parallel Distrib. Syst. 31(1), 94\u2013110 (2020). https:\/\/doi.org\/10.1109\/TPDS.2019.2928289","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"issue":"08","key":"271_CR23","doi-asserted-by":"publisher","first-page":"2377","DOI":"10.1109\/TPDS.2023.3281931","volume":"34","author":"P Liang","year":"2023","unstructured":"Liang, P., Tang, Y., Zhang, X., Bai, Y., Su, T., Lai, Z., Qiao, L., Li, D.: A survey on auto-parallelism of large-scale deep learning training. IEEE Trans. Parallel Distrib. Syst. 34(08), 2377\u20132390 (2023). https:\/\/doi.org\/10.1109\/TPDS.2023.3281931","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"271_CR24","doi-asserted-by":"publisher","unstructured":"Liu, Z, Cheng, S, Zhou, H, You, Y: Hanayo: Harnessing wave-like pipeline parallelism for enhanced large model training efficiency. The International Conference for High Performance Computing, Networking, Storage, and Analysis, 1\u201313 (2023) https:\/\/doi.org\/10.1145\/3581784.3607073","DOI":"10.1145\/3581784.3607073"},{"key":"271_CR25","doi-asserted-by":"publisher","unstructured":"Liu, H, Zaharia, M, Abbeel, P: Ring attention with blockwise transformers for near-infinite context. ArXiv abs\/2310.01889 (2023) https:\/\/doi.org\/10.48550\/arXiv.2310.01889","DOI":"10.48550\/arXiv.2310.01889"},{"key":"271_CR26","doi-asserted-by":"publisher","unstructured":"Liu, H, Zaharia, M, Abbeel, P: Ring attention with blockwise transformers for near-infinite context. CoRR abs\/2310.01889 (2023) https:\/\/doi.org\/10.48550\/ARXIV.2310.01889","DOI":"10.48550\/ARXIV.2310.01889"},{"key":"271_CR27","doi-asserted-by":"publisher","unstructured":"Lu, W, Yan, G, Li, J, Gong, S, Han, Y, Li, X: Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks. In: 2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 4-8, 2017, pp. 553\u2013564. IEEE Computer Society, ??? (2017). https:\/\/doi.org\/10.1109\/HPCA.2017.29","DOI":"10.1109\/HPCA.2017.29"},{"issue":"1","key":"271_CR28","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1145\/3363554","volume":"53","author":"R Mayer","year":"2021","unstructured":"Mayer, R., Jacobsen, H.: Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Comput. Surv. 53(1), 3\u20131337 (2021). https:\/\/doi.org\/10.1145\/3363554","journal-title":"ACM Comput. Surv."},{"key":"271_CR29","doi-asserted-by":"publisher","unstructured":"Narayanan, D, Phanishayee, A, Shi, K, Chen, X, Zaharia, M: Memory-efficient pipeline-parallel dnn training. In: International Conference on Machine Learning (ICML 2021) (2021). https:\/\/doi.org\/10.48550\/arXiv.2006.09503","DOI":"10.48550\/arXiv.2006.09503"},{"key":"271_CR30","doi-asserted-by":"publisher","unstructured":"Narayanan, D, Shoeybi, M, Casper, J, LeGresley, P, Patwary, M, Korthikanti, V, Vainbrand, D, Kashinkunti, P, Bernauer, J, Catanzaro, B, Phanishayee, A, Zaharia, M.: Efficient large-scale language model training on GPU clusters using megatron-lm. In: Supinski, B.R, Hall, M.W, Gamblin, T. (eds.) International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021, p. 58 (2021). https:\/\/doi.org\/10.1145\/3458817.3476209","DOI":"10.1145\/3458817.3476209"},{"key":"271_CR31","doi-asserted-by":"publisher","unstructured":"OpenAI, Achiam, J, Adler, S, et al.: Gpt-4 technical report. ArXiv abs\/2303.08774 (2023) https:\/\/doi.org\/10.48550\/arXiv.2303.08774","DOI":"10.48550\/arXiv.2303.08774"},{"key":"271_CR32","unstructured":"Paszke, A, et al.: Pytorch: An imperative style, high-performance deep learning library. In: Proc. Adv. Neural Inf. Process. Syst., pp. 8024\u20138035 (2019)"},{"issue":"2","key":"271_CR33","doi-asserted-by":"publisher","first-page":"117","DOI":"10.1016\/J.JPDC.2008.09.002","volume":"69","author":"P Patarasuk","year":"2009","unstructured":"Patarasuk, P., Yuan, X.: Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distributed Comput. 69(2), 117\u2013124 (2009). https:\/\/doi.org\/10.1016\/J.JPDC.2008.09.002","journal-title":"J. Parallel Distributed Comput."},{"key":"271_CR34","unstructured":"Qi, H., Sparks, E.R., Talwalkar, A.: Paleo: A performance model for deep neural networks. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings (2017). https:\/\/openreview.net\/forum?id=SyVVJ85lg"},{"key":"271_CR35","unstructured":"Qi, P., Wan, X., Huang, G., Lin, M.: Zero bubble (almost) pipeline parallelism. In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 (2024). https:\/\/openreview.net\/forum?id=tuzTN0eIO5"},{"key":"271_CR36","doi-asserted-by":"publisher","unstructured":"Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: memory optimizations toward training trillion parameter models. In: Cuicchi, C., Qualters, I., Kramer, W.T. (eds.) Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event \/ Atlanta, Georgia, USA, November 9-19, 2020, p. 20. IEEE\/ACM, ??? (2020). https:\/\/doi.org\/10.1109\/SC41405.2020.00024","DOI":"10.1109\/SC41405.2020.00024"},{"key":"271_CR37","doi-asserted-by":"publisher","unstructured":"Santhanam, K., Krishna, S., Tomioka, R., Fitzgibbon, A.W., Harris, T.: Distir: An intermediate representation for optimizing distributed neural networks. In: Yoneki, E., Patras, P. (eds.) EuroMLSys@EuroSys 2021, Proceedings of the 1st Workshop on Machine Learning and Systemsg Virtual Event, Edinburgh, Scotland, UK, 26 April, 2021, pp. 15\u201323. ACM, ??? (2021). https:\/\/doi.org\/10.1145\/3437984.3458829","DOI":"10.1145\/3437984.3458829"},{"key":"271_CR38","doi-asserted-by":"publisher","unstructured":"Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-lm: Training multi-billion parameter language models using model parallelism. ArXiv abs\/1909.08053 (2019) https:\/\/doi.org\/10.48550\/arXiv.1909.08053","DOI":"10.48550\/arXiv.1909.08053"},{"key":"271_CR39","doi-asserted-by":"publisher","unstructured":"Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi\u00e8re, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: Llama: Open and efficient foundation language models. ArXiv abs\/2302.13971 (2023) https:\/\/doi.org\/10.48550\/arXiv.2302.13971","DOI":"10.48550\/arXiv.2302.13971"},{"key":"271_CR40","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)"},{"key":"271_CR41","doi-asserted-by":"publisher","unstructured":"Won, W., Heo, T., Rashidi, S., Sridharan, S., Srinivasan, S., Krishna, T.: Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale. In: 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 283\u2013294. IEEE Computer Society, Los Alamitos, CA, USA (2023). https:\/\/doi.org\/10.1109\/ISPASS57527.2023.00035","DOI":"10.1109\/ISPASS57527.2023.00035"},{"key":"271_CR42","unstructured":"Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., Liu, T.: On layer normalization in the transformer architecture. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol. 119, pp. 10524\u201310533 (2020). http:\/\/proceedings.mlr.press\/v119\/xiong20b.html"},{"key":"271_CR43","doi-asserted-by":"publisher","unstructured":"Yan, F., Ruwase, O., He, Y., Chilimbi, T.: Performance modeling and scalability optimization of distributed deep learning systems. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD \u201915, pp. 1355\u20131364. Association for Computing Machinery, New York, NY, USA (2015). https:\/\/doi.org\/10.1145\/2783258.2783270","DOI":"10.1145\/2783258.2783270"},{"key":"271_CR44","unstructured":"Zhang, B., Sennrich, R.: Root mean square layer normalization. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d\u2019Alch\u00e9-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 12360\u201312371 (2019)"},{"key":"271_CR45","doi-asserted-by":"publisher","unstructured":"Zhao, Y., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., Desmaison, A., Balioglu, C., Damania, P., Nguyen, B., Chauhan, G., Hao, Y., Mathews, A., Li, S.: Pytorch fsdp: Experiences on scaling fully sharded data parallel. Proc. VLDB Endow. 16(12), 3848\u20133860 (2023) https:\/\/doi.org\/10.14778\/3611540.3611569","DOI":"10.14778\/3611540.3611569"}],"container-title":["CCF Transactions on High Performance Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42514-025-00271-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s42514-025-00271-w","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42514-025-00271-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T08:08:38Z","timestamp":1775030918000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s42514-025-00271-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,16]]},"references-count":45,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,4]]}},"alternative-id":["271"],"URL":"https:\/\/doi.org\/10.1007\/s42514-025-00271-w","relation":{},"ISSN":["2524-4922","2524-4930"],"issn-type":[{"value":"2524-4922","type":"print"},{"value":"2524-4930","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,16]]},"assertion":[{"value":"13 January 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 December 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 March 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"On behalf of all authors, the corresponding author states that there is no Conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}