{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T12:36:20Z","timestamp":1773318980219,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":63,"publisher":"ACM","funder":[{"DOI":"10.13039\/501100018537","name":"National Science and Technology Major Project","doi-asserted-by":"publisher","award":["2023ZD0120502"],"award-info":[{"award-number":["2023ZD0120502"]}],"id":[{"id":"10.13039\/501100018537","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["Grant No. 62372055"],"award-info":[{"award-number":["Grant No. 62372055"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"publisher","award":[""],"award-info":[{"award-number":[""]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,11,16]]},"DOI":"10.1145\/3712285.3759783","type":"proceedings-article","created":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T16:05:39Z","timestamp":1762963539000},"page":"1755-1768","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Hypertron: Efficiently Scaling Large Models by Exploring High-Dimensional Parallelization Space"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0022-7865","authenticated-orcid":false,"given":"Shigang","family":"Li","sequence":"first","affiliation":[{"name":"School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-6866-6215","authenticated-orcid":false,"given":"Jingkun","family":"Dong","sequence":"additional","affiliation":[{"name":"School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-8800-6026","authenticated-orcid":false,"given":"Jihao","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-9847-2770","authenticated-orcid":false,"given":"Zhi","family":"Ma","sequence":"additional","affiliation":[{"name":"School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6708-3942","authenticated-orcid":false,"given":"Zhongzhe","family":"Hu","sequence":"additional","affiliation":[{"name":"Huawei, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,11,15]]},"reference":[{"key":"e_1_3_3_2_2_2","doi-asserted-by":"crossref","unstructured":"Ramesh\u00a0C Agarwal Susanne\u00a0M Balle Fred\u00a0G Gustavson Mahesh Joshi and Prasad Palkar. 1995. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development 39 5 (1995) 575\u2013582.","DOI":"10.1147\/rd.395.0575"},{"key":"e_1_3_3_2_3_2","doi-asserted-by":"crossref","unstructured":"Alok Aggarwal Ashok\u00a0K Chandra and Marc Snir. 1990. Communication complexity of PRAMs. Theoretical Computer Science 71 1 (1990) 3\u201328.","DOI":"10.1016\/0304-3975(90)90188-N"},{"key":"e_1_3_3_2_4_2","unstructured":"AI21. 2024. Introducing Jamba: AI21\u2019s Groundbreaking SSM-Transformer Model. https:\/\/www.ai21.com\/blog\/announcing-jamba."},{"key":"e_1_3_3_2_5_2","doi-asserted-by":"publisher","unstructured":"Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis. ACM Comput. Surv. 52 4 Article 65 (Aug. 2019) 43\u00a0pages. 10.1145\/3320060","DOI":"10.1145\/3320060"},{"key":"e_1_3_3_2_6_2","doi-asserted-by":"crossref","unstructured":"Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. ACM Comput. Surv. 52 4 (Aug. 2019) 65:1\u201365:43.","DOI":"10.1145\/3320060"},{"key":"e_1_3_3_2_7_2","unstructured":"Xiao Bi Deli Chen Guanting Chen Shanhuang Chen Damai Dai Chengqi Deng Honghui Ding Kai Dong Qiushi Du Zhe Fu et\u00a0al. 2024. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2401.02954 (2024)."},{"key":"e_1_3_3_2_8_2","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared\u00a0D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et\u00a0al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877\u20131901."},{"key":"e_1_3_3_2_9_2","doi-asserted-by":"crossref","unstructured":"Pawe\u0142 Budzianowski and Ivan Vuli\u0107. 2019. Hello it\u2019s GPT-2\u2013how can I help you? towards the use of pretrained language models for task-oriented dialogue systems. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1907.05774 (2019).","DOI":"10.18653\/v1\/D19-5602"},{"key":"e_1_3_3_2_10_2","doi-asserted-by":"crossref","unstructured":"Jinfan Chen Shigang Li Ran Gun Jinhui Yuan and Torsten Hoefler. 2024. AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication. IEEE Transactions on Parallel and Distributed Systems (2024).","DOI":"10.1109\/TPDS.2024.3397800"},{"key":"e_1_3_3_2_11_2","unstructured":"Tianqi Chen Bing Xu Chiyuan Zhang and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1604.06174 (2016)."},{"key":"e_1_3_3_2_12_2","unstructured":"Aakanksha Chowdhery Sharan Narang Jacob Devlin Maarten Bosma Gaurav Mishra Adam Roberts Paul Barham Hyung\u00a0Won Chung Charles Sutton Sebastian Gehrmann Parker Schuh Kensen Shi Sasha Tsvyashchenko Joshua Maynez Abhishek Rao Parker Barnes Yi Tay Noam Shazeer Vinodkumar Prabhakaran Emily Reif Nan Du Ben Hutchinson Reiner Pope James Bradbury Jacob Austin Michael Isard Guy Gur-Ari Pengcheng Yin Toju Duke Anselm Levskaya Sanjay Ghemawat Sunipa Dev Henryk Michalewski Xavier Garcia Vedant Misra Kevin Robinson Liam Fedus Denny Zhou Daphne Ippolito David Luan Hyeontaek Lim Barret Zoph Alexander Spiridonov Ryan Sepassi David Dohan Shivani Agrawal Mark Omernick Andrew\u00a0M. Dai Thanumalayan\u00a0Sankaranarayana Pillai Marie Pellat Aitor Lewkowycz Erica Moreira Rewon Child Oleksandr Polozov Katherine Lee Zongwei Zhou Xuezhi Wang Brennan Saeta Mark Diaz Orhan Firat Michele Catasta Jason Wei Kathy Meier-Hellstern Douglas Eck Jeff Dean Slav Petrov and Noah Fiedel. 2023. PaLM: Scaling Language Modeling with Pathways. Journal of Machine Learning Research 24 240 (2023) 1\u2013113. http:\/\/jmlr.org\/papers\/v24\/22-1144.html"},{"key":"e_1_3_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3695053.3731410"},{"key":"e_1_3_3_2_14_2","doi-asserted-by":"crossref","unstructured":"James Demmel David Eliahu Armando Fox Shoaib Kamil Benjamin Lipshitz Oded Schwartz and Omer Spillinger. 2013. Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication. International Symposium on Parallel and Distributed Processing (2013).","DOI":"10.1109\/IPDPS.2013.80"},{"key":"e_1_3_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356207"},{"key":"e_1_3_3_2_16_2","unstructured":"Shiqing Fan Yi Rong Chen Meng Zongyan Cao Siyu Wang Zhen Zheng Chuan Wu Guoping Long Jun Yang Lixue Xia Lansong Diao Xiaoyong Liu and Wei Lin. 2021. DAPPLE: a pipelined data parallel approach for training large models. Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2021)."},{"key":"e_1_3_3_2_17_2","unstructured":"Jiarui Fang and Shangchun Zhao. 2024. A Unified Sequence Parallelism Approach for Long Context Generative AI. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2405.07719 (2024)."},{"key":"e_1_3_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503221.3508418"},{"key":"e_1_3_3_2_19_2","unstructured":"Yanping Huang Youlong Cheng Ankur Bapna Orhan Firat Mia\u00a0Xu Chen Dehao Chen HyoukJoong Lee Jiquan Ngiam Quoc\u00a0V. Le Yonghui Wu and Zhifeng Chen. 2018. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. Advances in Neural Information Processing Systems (2018)."},{"key":"e_1_3_3_2_20_2","unstructured":"Huawei. 2025. torch-npu. Online. https:\/\/gitee.com\/ascend\/pytorch Accessed: 2025-04-14."},{"key":"e_1_3_3_2_21_2","unstructured":"Huawei Ascend. 2024. CANN HCCL: Huawei Collective Communication Library. Online. https:\/\/gitee.com\/ascend\/cann-hccl Accessed: 2025-03-20."},{"key":"e_1_3_3_2_22_2","unstructured":"Huawei ascend. 2025. Compute Architecture for Neural Networks. Online. https:\/\/www.hiascend.com\/developer\/download\/community\/result?module=cann Accessed: 2025-04-14."},{"key":"e_1_3_3_2_23_2","unstructured":"Huawei ascend. 2025. Huawei Cache Coherent System. Online. https:\/\/www.hiascend.com\/doc_center\/source\/zh\/Pytorch\/600\/ptmoddevg\/trainingmigrguide\/performance_tuning_0042.html Accessed: 2025-04-14."},{"key":"e_1_3_3_2_24_2","unstructured":"Huawei cloud. 2025. Huawei Cloud launched the CloudMatrix 384 super node. Online. https:\/\/news.futunn.com\/en\/flash\/18684265\/huawei-cloud-launched-the-cloudmatrix-384-super-node-which-has?level=1&data_ticket=1744606503642941 Accessed: 2025-04-14."},{"key":"e_1_3_3_2_25_2","doi-asserted-by":"crossref","unstructured":"Dror Irony Sivan Toledo and Alexander Tiskin. 2004. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel and Distrib. Comput. 64 9 (2004) 1017\u20131026.","DOI":"10.1016\/j.jpdc.2004.03.021"},{"key":"e_1_3_3_2_26_2","first-page":"711","volume-title":"Proceedings of Machine Learning and Systems","volume":"3","author":"Ivanov Andrei","year":"2021","unstructured":"Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. 2021. Data Movement Is All You Need: A Case Study on Optimizing Transformers. In Proceedings of Machine Learning and Systems , A.\u00a0Smola, A.\u00a0Dimakis, and I.\u00a0Stoica (Eds.), Vol.\u00a03. 711\u2013732."},{"key":"e_1_3_3_2_27_2","unstructured":"Sam\u00a0Ade Jacobs Masahiro Tanaka Chengming Zhang Minjia Zhang Leon Song Samyam Rajbhandari and Yuxiong He. 2023. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2309.14509 (2023)."},{"key":"e_1_3_3_2_28_2","unstructured":"Zhihao Jia Matei Zaharia and Alex Aiken. 2018. Beyond Data and Model Parallelism for Deep Neural Networks. Proceedings of the 2nd Conference on Machine Learning and Systems (2018)."},{"key":"e_1_3_3_2_29_2","unstructured":"Albert\u00a0Q Jiang Alexandre Sablayrolles Antoine Roux Arthur Mensch Blanche Savary Chris Bamford Devendra\u00a0Singh Chaplot Diego de\u00a0las Casas Emma\u00a0Bou Hanna Florian Bressand et\u00a0al. 2024. Mixtral of experts. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2401.04088 (2024)."},{"key":"e_1_3_3_2_30_2","first-page":"745","volume-title":"21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)","author":"Jiang Ziheng","year":"2024","unstructured":"Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, and Xin Liu. 2024. MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). USENIX Association, Santa Clara, CA, 745\u2013760. https:\/\/www.usenix.org\/conference\/nsdi24\/presentation\/jiang-ziheng"},{"key":"e_1_3_3_2_31_2","unstructured":"Jared Kaplan Sam McCandlish T.\u00a0J. Henighan Tom\u00a0B. Brown Benjamin Chess Rewon Child Scott Gray Alec Radford Jeff Wu and Dario Amodei. 2020. Scaling Laws for Neural Language Models. ArXiv abs\/2001.08361 (2020)."},{"key":"e_1_3_3_2_32_2","unstructured":"Vijay Korthikanti Sangkug\u00a0Lym Jared\u00a0Casper Lawrence McAfee Mohammad\u00a0Shoeybi Michael\u00a0Andersch and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models[J]. Proceedings of Machine Learning and SystemS. Proceedings of Machine Learning and Systems (2023)."},{"key":"e_1_3_3_2_33_2","unstructured":"Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. ArXiv abs\/1404.5997 (2014)."},{"key":"e_1_3_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356181"},{"key":"e_1_3_3_2_35_2","unstructured":"Dmitry Lepikhin HyoukJoong Lee Yuanzhong Xu Dehao Chen Orhan Firat Yanping Huang Maxim Krikun Noam Shazeer and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2006.16668 (2020)."},{"key":"e_1_3_3_2_36_2","unstructured":"Dacheng Li Rulin Shao Anze Xie Eric\u00a0P Xing Xuezhe Ma Ion Stoica Joseph\u00a0E Gonzalez and Hao Zhang. 2023. Distflashattn: Distributed memory-efficient attention for long-context llms training. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2310.03294 (2023)."},{"key":"e_1_3_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/2640087.2644155"},{"key":"e_1_3_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3332466.3374528"},{"key":"e_1_3_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476145"},{"key":"e_1_3_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3605573.3605613"},{"key":"e_1_3_3_2_41_2","unstructured":"Shenggui Li Fuzhao Xue Chaitanya Baranwal Yongbin Li and Yang You. 2021. Sequence parallelism: Long sequence training from system perspective. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2105.13120 (2021)."},{"key":"e_1_3_3_2_42_2","unstructured":"Aixin Liu Bei Feng Bin Wang Bingxuan Wang Bo Liu Chenggang Zhao Chengqi Dengr Chong Ruan Damai Dai Daya Guo et\u00a0al. 2024. Deepseek-v2: A strong economical and efficient mixture-of-experts language model. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2405.04434 (2024)."},{"key":"e_1_3_3_2_43_2","unstructured":"Aixin Liu Bei Feng Bing Xue Bingxuan Wang Bochao Wu Chengda Lu Chenggang Zhao Chengqi Deng Chenyu Zhang Chong Ruan et\u00a0al. 2024. DeepSeek-V3 Technical Report. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2412.19437 (2024)."},{"key":"e_1_3_3_2_44_2","unstructured":"Hao Liu Matei Zaharia and Pieter Abbeel. 2023. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2310.01889 (2023)."},{"key":"e_1_3_3_2_45_2","unstructured":"Yuliang Liu Shenggui Li Jiarui Fang Yanjun Shao Boyuan Yao and Yang You. 2023. Colossal-auto: Unified automation of parallelization and activation checkpoint for large-scale models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2302.02599 (2023)."},{"key":"e_1_3_3_2_46_2","doi-asserted-by":"crossref","unstructured":"Deepak Narayanan Aaron Harlap Amar Phanishayee Vivek Seshadri Nikhil Devanur Greg Granger Phil Gibbons and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. The 27th ACM Symposium on Operating Systems Principles (2019).","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_3_3_2_48_2","unstructured":"NVIDIA. 2025. NVIDIA Collective Communications Library. Online. https:\/\/github.com\/NVIDIA\/nccl Accessed: 2025-04-14."},{"key":"e_1_3_3_2_49_2","unstructured":"OpenAI. 2023. GPT-4 Technical Report. ArXiv abs\/2303.08774 (2023)."},{"key":"e_1_3_3_2_50_2","first-page":"18332","volume-title":"International conference on machine learning","author":"Rajbhandari Samyam","year":"2022","unstructured":"Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza\u00a0Yazdani Aminabadi, Ammar\u00a0Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International conference on machine learning. PMLR, 18332\u201318346."},{"key":"e_1_3_3_2_51_2","volume-title":"arXiv e-prints arXiv:https:\/\/arXiv.org\/abs\/11910.02054","author":"Rajbhandari Samyam","year":"2019","unstructured":"Samyam Rajbhandari, Jeff Rasley1, Olatunji Ruwase, and Yuxiong He. 2019. ZeRO: memory optimization towards training a trillion parameter models. In arXiv e-prints arXiv:https:\/\/arXiv.org\/abs\/11910.02054."},{"key":"e_1_3_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3406703"},{"key":"e_1_3_3_2_53_2","unstructured":"Alexander Sergeev and Mike Del\u00a0Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1802.05799 (2018)."},{"key":"e_1_3_3_2_54_2","unstructured":"Noam Shazeer Azalia Mirhoseini Krzysztof Maziarz Andy Davis Quoc Le Geoffrey Hinton and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1701.06538 (2017)."},{"key":"e_1_3_3_2_55_2","unstructured":"Mohammad Shoeybi Mostofa Patwary Raul Puri Patrick LeGresley Jared Casper and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1909.08053 (2019)."},{"key":"e_1_3_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3577193.3593704"},{"key":"e_1_3_3_2_57_2","first-page":"90","volume-title":"European Conference on Parallel Processing","author":"Solomonik Edgar","year":"2011","unstructured":"Edgar Solomonik and James Demmel. 2011. Communication-optimal parallel 2.5 D matrix multiplication and LU factorization algorithms. In European Conference on Parallel Processing. Springer, 90\u2013109."},{"key":"e_1_3_3_2_58_2","unstructured":"Gemini Team Rohan Anil Sebastian Borgeaud Yonghui Wu Jean-Baptiste Alayrac Jiahui Yu Radu Soricut Johan Schalkwyk Andrew\u00a0M Dai Anja Hauth et\u00a0al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2312.11805 (2023)."},{"key":"e_1_3_3_2_59_2","unstructured":"Gemini Team Petko Georgiev Ving\u00a0Ian Lei Ryan Burnell Libin Bai Anmol Gulati Garrett Tanzer Damien Vincent Zhufeng Pan Shibo Wang et\u00a0al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2403.05530 (2024)."},{"key":"e_1_3_3_2_60_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan\u00a0N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_3_2_61_2","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan\u00a0N Gomez, \u0141\u00a0ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems , I.\u00a0Guyon, U.\u00a0Von Luxburg, S.\u00a0Bengio, H.\u00a0Wallach, R.\u00a0Fergus, S.\u00a0Vishwanathan, and R.\u00a0Garnett (Eds.), Vol.\u00a030. Curran Associates, Inc.https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"},{"key":"e_1_3_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3302424.3303953"},{"key":"e_1_3_3_2_63_2","unstructured":"Jinhui Yuan Xinqi Li Cheng Cheng Juncheng Liu Ran Guo Shenghang Cai Chi Yao Fei Yang Xiaodong Yi Chuan Wu et\u00a0al. 2021. Oneflow: Redesign the distributed deep learning framework from scratch. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2110.15032 (2021)."},{"key":"e_1_3_3_2_64_2","first-page":"559","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric\u00a0P. Xing, Joseph\u00a0E. Gonzalez, and Ion Stoica. 2022. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 559\u2013578. https:\/\/www.usenix.org\/conference\/osdi22\/presentation\/zheng-lianmin"}],"event":{"name":"SC '25: The International Conference for High Performance Computing, Networking, Storage and Analysis","location":"St. Louis MO USA","acronym":"SC '25","sponsor":["SIGHPC ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing"]},"container-title":["Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3712285.3759783","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T18:54:50Z","timestamp":1773255290000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3712285.3759783"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,15]]},"references-count":63,"alternative-id":["10.1145\/3712285.3759783","10.1145\/3712285"],"URL":"https:\/\/doi.org\/10.1145\/3712285.3759783","relation":{},"subject":[],"published":{"date-parts":[[2025,11,15]]},"assertion":[{"value":"2025-11-15","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}