{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T12:28:44Z","timestamp":1773318524815,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":62,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,11,16]]},"DOI":"10.1145\/3712285.3759775","type":"proceedings-article","created":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T16:04:47Z","timestamp":1762963487000},"page":"1351-1367","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Balanced and Elastic End-to-end Training of Dynamic LLMs"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7165-2095","authenticated-orcid":false,"given":"Mohamed","family":"Wahib","sequence":"first","affiliation":[{"name":"RIKEN Center for Computational Science (R-CCS), Tokyo, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2880-0857","authenticated-orcid":false,"given":"Muhammed Abdullah","family":"Soyturk","sequence":"additional","affiliation":[{"name":"Ko\u00e7 University, Turkey, Istanbul, Turkiye"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2351-0770","authenticated-orcid":false,"given":"Didem","family":"Unat","sequence":"additional","affiliation":[{"name":"Ko\u00e7 University, Turkey, Istanbul, Turkiye and NVIDIA Corporation, Istanbul, Turkiye"}]}],"member":"320","published-online":{"date-parts":[[2025,11,15]]},"reference":[{"key":"e_1_3_3_3_2_2","unstructured":"Guillaume Bellec David Kappel Wolfgang Maass and Robert Legenstein. 2017. Deep rewiring: Training very sparse deep networks. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1711.05136 (2017)."},{"key":"e_1_3_3_3_3_2","doi-asserted-by":"publisher","unstructured":"Dimitri\u00a0P. Bertsekas. 1992. Auction algorithms for network flow problems: A tutorial introduction. Computational Optimization and Applications 1 1 (01 Oct 1992) 7\u201366. 10.1007\/BF00247653","DOI":"10.1007\/BF00247653"},{"key":"e_1_3_3_3_4_2","unstructured":"Davis Blalock Jose\u00a0Javier Gonzalez\u00a0Ortiz Jonathan Frankle and John Guttag. 2020. What is the state of neural network pruning? Proceedings of machine learning and systems 2 (2020) 129\u2013146."},{"key":"e_1_3_3_3_5_2","doi-asserted-by":"publisher","unstructured":"Hongrong Cheng Miao Zhang and Javen\u00a0Qinfeng Shi. 2024. A Survey on Deep Neural Network Pruning: Taxonomy Comparison Analysis and Recommendations. IEEE Transactions on Pattern Analysis and Machine Intelligence 46 12 (2024) 10558\u201310578. 10.1109\/TPAMI.2024.3447085","DOI":"10.1109\/TPAMI.2024.3447085"},{"key":"e_1_3_3_3_6_2","series-title":"(NIPS \u201922)","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems","author":"Dao Tri","year":"2024","unstructured":"Tri Dao, Daniel\u00a0Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2024. FLASHATTENTION: fast and memory-efficient exact attention with IO-awareness. In Proceedings of the 36th International Conference on Neural Information Processing Systems (New Orleans, LA, USA) (NIPS \u201922). Curran Associates Inc., Red Hook, NY, USA, Article 1189, 16\u00a0pages."},{"key":"e_1_3_3_3_7_2","unstructured":"DeepSeek-AI Aixin Liu Bei Feng Bing Xue Bingxuan Wang Bochao Wu Chengda Lu Chenggang Zhao Chengqi Deng Chenyu Zhang Chong Ruan Damai Dai Daya Guo Dejian Yang Deli Chen Dongjie Ji Erhang Li Fangyun Lin Fucong Dai Fuli Luo Guangbo Hao Guanting Chen Guowei Li H. Zhang Han Bao Hanwei Xu Haocheng Wang Haowei Zhang Honghui Ding Huajian Xin Huazuo Gao Hui Li Hui Qu J.\u00a0L. Cai Jian Liang Jianzhong Guo Jiaqi Ni Jiashi Li Jiawei Wang Jin Chen Jingchang Chen Jingyang Yuan Junjie Qiu Junlong Li Junxiao Song Kai Dong Kai Hu Kaige Gao Kang Guan Kexin Huang Kuai Yu Lean Wang Lecong Zhang Lei Xu Leyi Xia Liang Zhao Litong Wang Liyue Zhang Meng Li Miaojun Wang Mingchuan Zhang Minghua Zhang Minghui Tang Mingming Li Ning Tian Panpan Huang Peiyi Wang Peng Zhang Qiancheng Wang Qihao Zhu Qinyu Chen Qiushi Du R.\u00a0J. Chen R.\u00a0L. Jin Ruiqi Ge Ruisong Zhang Ruizhe Pan Runji Wang Runxin Xu Ruoyu Zhang Ruyi Chen S.\u00a0S. Li Shanghao Lu Shangyan Zhou Shanhuang Chen Shaoqing Wu Shengfeng Ye Shengfeng Ye Shirong Ma Shiyu Wang Shuang Zhou Shuiping Yu Shunfeng Zhou Shuting Pan T. Wang Tao Yun Tian Pei Tianyu Sun W.\u00a0L. Xiao Wangding Zeng Wanjia Zhao Wei An Wen Liu Wenfeng Liang Wenjun Gao Wenqin Yu Wentao Zhang X.\u00a0Q. Li Xiangyue Jin Xianzu Wang Xiao Bi Xiaodong Liu Xiaohan Wang Xiaojin Shen Xiaokang Chen Xiaokang Zhang Xiaosha Chen Xiaotao Nie Xiaowen Sun Xiaoxiang Wang Xin Cheng Xin Liu Xin Xie Xingchao Liu Xingkai Yu Xinnan Song Xinxia Shan Xinyi Zhou Xinyu Yang Xinyuan Li Xuecheng Su Xuheng Lin Y.\u00a0K. Li Y.\u00a0Q. Wang Y.\u00a0X. Wei Y.\u00a0X. Zhu Yang Zhang Yanhong Xu Yanhong Xu Yanping Huang Yao Li Yao Zhao Yaofeng Sun Yaohui Li Yaohui Wang Yi Yu Yi Zheng Yichao Zhang Yifan Shi Yiliang Xiong Ying He Ying Tang Yishi Piao Yisong Wang Yixuan Tan Yiyang Ma Yiyuan Liu Yongqiang Guo Yu Wu Yuan Ou Yuchen Zhu Yuduan Wang Yue Gong Yuheng Zou Yujia He Yukun Zha Yunfan Xiong Yunxian Ma Yuting Yan Yuxiang Luo Yuxiang You Yuxuan Liu Yuyang Zhou Z.\u00a0F. Wu Z.\u00a0Z. Ren Zehui Ren Zhangli Sha Zhe Fu Zhean Xu Zhen Huang Zhen Zhang Zhenda Xie Zhengyan Zhang Zhewen Hao Zhibin Gou Zhicheng Ma Zhigang Yan Zhihong Shao Zhipeng Xu Zhiyu Wu Zhongyu Zhang Zhuoshu Li Zihui Gu Zijia Zhu Zijun Liu Zilin Li Ziwei Xie Ziyang Song Ziyi Gao and Zizheng Pan. 2025. DeepSeek-V3 Technical Report. arxiv:https:\/\/arXiv.org\/abs\/2412.19437\u00a0[cs.CL] https:\/\/arxiv.org\/abs\/2412.19437"},{"key":"e_1_3_3_3_8_2","unstructured":"Elastic. 2025. Elastic Cloud on Kubernetes (ECK). https:\/\/github.com\/elastic\/cloud-on-k8s [Retrieved 22 February 2025]."},{"key":"e_1_3_3_3_9_2","volume-title":"International Conference on Learning Representations","author":"Elbayad Maha","year":"2020","unstructured":"Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. 2020. Depth-Adaptive Transformer. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=SJg7KhVKPH"},{"key":"e_1_3_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3437801.3441593"},{"key":"e_1_3_3_3_11_2","unstructured":"William Fedus Barret Zoph and Noam Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. J. Mach. Learn. Res. 23 (2022) 120:1\u2013120:39."},{"key":"e_1_3_3_3_12_2","unstructured":"Wikimedia Foundation. 2023. Wikimedia Downloads. https:\/\/dumps.wikimedia.org"},{"key":"e_1_3_3_3_13_2","unstructured":"Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse trainable neural networks. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1803.03635 (2018)."},{"key":"e_1_3_3_3_14_2","unstructured":"Trevor Gale Erich Elsen and Sara Hooker. 2019. The state of sparsity in deep neural networks. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1902.09574 (2019)."},{"key":"e_1_3_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.5555\/3433701.3433723"},{"key":"e_1_3_3_3_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN.1993.713929"},{"key":"e_1_3_3_3_17_2","unstructured":"Yizeng Han Gao Huang Shiji Song Le Yang Honghui Wang and Yulin Wang. 2021. Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021)."},{"key":"e_1_3_3_3_18_2","unstructured":"Aaron Harlap Deepak Narayanan Amar Phanishayee Vivek Seshadri Nikhil Devanur Greg Ganger and Phil Gibbons. 2018. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1806.03377 (2018)."},{"key":"e_1_3_3_3_19_2","unstructured":"Chaoyang He Shen Li Mahdi Soltanolkotabi and Salman Avestimehr. 2021. Pipetransformer: Automated elastic pipelining for distributed training of transformers. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2102.03161 (2021)."},{"key":"e_1_3_3_3_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503221.3508418"},{"key":"e_1_3_3_3_21_2","unstructured":"Yanping Huang Youlong Cheng Ankur Bapna Orhan Firat Dehao Chen Mia Chen HyoukJoong Lee Jiquan Ngiam Quoc\u00a0V Le Yonghui Wu et\u00a0al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_3_3_3_22_2","unstructured":"Changho Hwang Wei Cui Yifan Xiong Ziyue Yang Ze Liu Han Hu Zilong Wang Rafael Salas Jithin Jose Prabhat Ram Joe Chau Peng Cheng Fan Yang Mao Yang and Yongqiang Xiong. 2022. Tutel: Adaptive Mixture-of-Experts at Scale. CoRR abs\/2206.03382 (June 2022). https:\/\/arxiv.org\/pdf\/2206.03382.pdf"},{"key":"e_1_3_3_3_23_2","doi-asserted-by":"crossref","unstructured":"Robert\u00a0A Jacobs Michael\u00a0I Jordan Steven\u00a0J Nowlan and Geoffrey\u00a0E Hinton. 1991. Adaptive mixtures of local experts. Neural computation 3 1 (1991) 79\u201387.","DOI":"10.1162\/neco.1991.3.1.79"},{"key":"e_1_3_3_3_24_2","unstructured":"Albert\u00a0Q. Jiang Alexandre Sablayrolles Antoine Roux Arthur Mensch Blanche Savary Chris Bamford Devendra\u00a0Singh Chaplot Diego de\u00a0las Casas Emma\u00a0Bou Hanna Florian Bressand Gianna Lengyel Guillaume Bour Guillaume Lample L\u00e9lio\u00a0Renard Lavaud Lucile Saulnier Marie-Anne Lachaux Pierre Stock Sandeep Subramanian Sophia Yang Szymon Antoniak Teven\u00a0Le Scao Th\u00e9ophile Gervet Thibaut Lavril Thomas Wang Timoth\u00e9e Lacroix and William\u00a0El Sayed. 2024. Mixtral of Experts. arxiv:https:\/\/arXiv.org\/abs\/2401.04088\u00a0[cs.LG]"},{"key":"e_1_3_3_3_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431379.3460644"},{"key":"e_1_3_3_3_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3534678.3539260"},{"key":"e_1_3_3_3_27_2","unstructured":"Dmitry Lepikhin HyoukJoong Lee Yuanzhong Xu Dehao Chen Orhan Firat Yanping Huang Maxim Krikun Noam Shazeer and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2006.16668 (2020)."},{"key":"e_1_3_3_3_28_2","series-title":"Proceedings of Machine Learning Research","first-page":"6265","volume-title":"Proceedings of the 38th International Conference on Machine Learning","volume":"139","author":"Lewis Mike","year":"2021","unstructured":"Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. 2021. BASE Layers: Simplifying Training of Large, Sparse Models. In Proceedings of the 38th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol.\u00a0139), Marina Meila and Tong Zhang (Eds.). PMLR, 6265\u20136274. https:\/\/proceedings.mlr.press\/v139\/lewis21a.html"},{"key":"e_1_3_3_3_29_2","first-page":"6265","volume-title":"International Conference on Machine Learning","author":"Lewis Mike","year":"2021","unstructured":"Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. 2021. Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning. PMLR, 6265\u20136274."},{"key":"e_1_3_3_3_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476145"},{"key":"e_1_3_3_3_31_2","doi-asserted-by":"publisher","unstructured":"Liu Liu Zheng Qu Zhaodong Chen Fengbin Tu Yufei Ding and Yuan Xie. 2022. Dynamic Sparse Attention for Scalable Transformer Acceleration. IEEE Trans. Comput. 71 12 (2022) 3165\u20133178. 10.1109\/TC.2022.3208206","DOI":"10.1109\/TC.2022.3208206"},{"key":"e_1_3_3_3_32_2","unstructured":"Yuhan Liu Saurabh Agarwal and Shivaram Venkataraman. 2021. AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning. arxiv:https:\/\/arXiv.org\/abs\/2102.01386\u00a0[cs.LG]"},{"key":"e_1_3_3_3_33_2","volume-title":"ICLR","author":"Liu Zhuang","year":"2022","unstructured":"Zhuang Liu, Zhiqiu Xu, Hung-Ju Wang, Trevor Darrell, and Evan Shelhamer. 2022. Anytime Dense Prediction with Confidence Adaptivity. In ICLR. OpenReview.net."},{"key":"e_1_3_3_3_34_2","volume-title":"DeepSpeed User Guide: Pipeline Parallelism","year":"2024","unstructured":"Microsoft. 2024. DeepSpeed User Guide: Pipeline Parallelism. https:\/\/deepspeed.readthedocs.io\/en\/latest\/pipeline.html#deepspeed.pipe.PipelineModule"},{"key":"e_1_3_3_3_35_2","unstructured":"Microsoft. 2023. Microsoft\/DeepSpeed: A deep learning optimization library that makes distributed training and inference easy efficient and effective.https:\/\/github.com\/microsoft\/deepspeed"},{"key":"e_1_3_3_3_36_2","unstructured":"Mistral. 2024. Hugging Face: Mixtral of Experts. https:\/\/huggingface.co\/papers\/2401.04088"},{"key":"e_1_3_3_3_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_3_3_3_38_2","volume-title":"Nvidia NCCL User Guide","year":"2024","unstructured":"Nvidia. 2024. Nvidia NCCL User Guide. https:\/\/docs.nvidia.com\/deeplearning\/nccl\/user-guide\/docs\/usage\/communicators.html"},{"key":"e_1_3_3_3_39_2","volume-title":"Thirty-seventh Conference on Neural Information Processing Systems","author":"Pagliardini Matteo","year":"2023","unstructured":"Matteo Pagliardini, Daniele Paliotta, Martin Jaggi, and Fran\u00e7ois Fleuret. 2023. Fast Attention Over Long Sequences With Dynamic Sparse Flash Attention. In Thirty-seventh Conference on Neural Information Processing Systems. https:\/\/openreview.net\/forum?id=UINHuKeWUa"},{"key":"e_1_3_3_3_40_2","doi-asserted-by":"crossref","unstructured":"Sai Prasanna Anna Rogers and Anna Rumshisky. 2020. When bert plays the lottery all tickets are winning. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2005.00561 (2020).","DOI":"10.18653\/v1\/2020.emnlp-main.259"},{"key":"e_1_3_3_3_41_2","doi-asserted-by":"crossref","unstructured":"Fareed Qararyah Mohamed Wahib Do\u011fa Dikbay\u0131r Mehmet\u00a0Esat Belviranli and Didem Unat. 2021. A computational-graph partitioning method for training memory-constrained DNNs. Parallel computing 104 (2021) 102792.","DOI":"10.1016\/j.parco.2021.102792"},{"key":"e_1_3_3_3_42_2","volume-title":"The Twelfth International Conference on Learning Representations","author":"Qi Penghui","year":"2024","unstructured":"Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2024. Zero Bubble (Almost) Pipeline Parallelism. In The Twelfth International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=tuzTN0eIO5"},{"key":"e_1_3_3_3_43_2","unstructured":"Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei and Ilya Sutskever. 2018. Language Models are Unsupervised Multitask Learners. (2018). https:\/\/d4mucfpksywv.cloudfront.net\/better-language-models\/language-models.pdf"},{"key":"e_1_3_3_3_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_3_3_3_45_2","unstructured":"David Raposo Sam Ritter Blake Richards Timothy\u00a0P. Lillicrap Peter Humphreys and Adam Santoro. 2024. Mixture-of-Depths: Dynamically allocating compute in transformer-based language models. https:\/\/api.semanticscholar.org\/CorpusID:268876220"},{"key":"e_1_3_3_3_46_2","volume-title":"Advances in Neural Information Processing Systems","author":"Schuster Tal","year":"2022","unstructured":"Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh\u00a0Q. Tran, Yi Tay, and Donald Metzler. 2022. Confident Adaptive Language Modeling. In Advances in Neural Information Processing Systems, Alice\u00a0H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https:\/\/openreview.net\/forum?id=uLYc4L3C81A"},{"key":"e_1_3_3_3_47_2","doi-asserted-by":"crossref","unstructured":"Jaime Sevilla Lennart Heim Anson Ho Tamay Besiroglu Marius Hobbhahn and Pablo Villalobos. 2022. Compute trends across three eras of machine learning. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2202.05924 (2022).","DOI":"10.1109\/IJCNN55064.2022.9891914"},{"key":"e_1_3_3_3_48_2","unstructured":"Noam Shazeer Azalia Mirhoseini Krzysztof Maziarz Andy Davis Quoc Le Geoffrey Hinton and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1701.06538 (2017)."},{"key":"e_1_3_3_3_49_2","volume-title":"ICLR (Poster)","author":"Shazeer Noam","year":"2017","unstructured":"Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc\u00a0V. Le, Geoffrey\u00a0E. Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.. In ICLR (Poster). OpenReview.net. http:\/\/dblp.uni-trier.de\/db\/conf\/iclr\/iclr2017.html#ShazeerMMDLHD17"},{"key":"e_1_3_3_3_50_2","unstructured":"Sheng Shen Alexei Baevski Ari\u00a0S Morcos Kurt Keutzer Michael Auli and Douwe Kiela. 2020. Reservoir transformers. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2012.15045 (2020)."},{"key":"e_1_3_3_3_51_2","unstructured":"Mohammad Shoeybi Mostofa Patwary Raul Puri Patrick LeGresley Jared Casper and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1909.08053 (2019)."},{"key":"e_1_3_3_3_52_2","doi-asserted-by":"crossref","unstructured":"Prasoon Sinha Akhil Guliani Rutwik Jain Brandon Tran Matthew\u00a0D Sinclair and Shivaram Venkataraman. 2022. Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale Accelerator-Rich Systems. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2208.11035 (2022).","DOI":"10.1109\/SC41404.2022.00070"},{"key":"e_1_3_3_3_53_2","unstructured":"Shaden Smith. 2023. Pipeline parallelism. https:\/\/www.deepspeed.ai\/tutorials\/pipeline\/#load-balancing-pipeline-modules"},{"key":"e_1_3_3_3_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS49936.2021.00109"},{"key":"e_1_3_3_3_55_2","unstructured":"Yi Tay Dara Bahri Liu Yang Donald Metzler and Da-Cheng Juan. 2020. Sparse Sinkhorn Attention. arxiv:https:\/\/arXiv.org\/abs\/2002.11296\u00a0[cs.LG]"},{"key":"e_1_3_3_3_56_2","doi-asserted-by":"publisher","unstructured":"Yi Tay Mostafa Dehghani Dara Bahri and Donald Metzler. 2022. Efficient Transformers: A Survey. ACM Comput. Surv. 55 6 Article 109 (dec 2022) 28\u00a0pages. 10.1145\/3530811","DOI":"10.1145\/3530811"},{"key":"e_1_3_3_3_57_2","unstructured":"LLaMA-MoE Team. 2024. LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training. https:\/\/github.com\/pjlab-sys4nlp\/llama-moe"},{"key":"e_1_3_3_3_58_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan\u00a0N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_3_3_59_2","unstructured":"Lean Wang Huazuo Gao Chenggang Zhao Xu Sun and Damai Dai. 2024. Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts. arxiv:https:\/\/arXiv.org\/abs\/2408.15664\u00a0[cs.LG] https:\/\/arxiv.org\/abs\/2408.15664"},{"key":"e_1_3_3_3_60_2","unstructured":"Yiding Wang Decang Sun Kai Chen Fan Lai and Mosharaf Chowdhury. 2022. Efficient DNN Training With Knowledge-guided Layer Freezing. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2201.06227 (2022)."},{"key":"e_1_3_3_3_61_2","unstructured":"Yanqi Zhou Tao Lei Hanxiao Liu Nan Du Yanping Huang Vincent Zhao Andrew Dai Zhifeng Chen Quoc Le and James Laudon. 2022. Mixture-of-Experts with Expert Choice Routing. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2202.09368 (2022)."},{"key":"e_1_3_3_3_62_2","volume-title":"NeurIPS","author":"Zhou Yanqi","year":"2022","unstructured":"Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew\u00a0M. Dai, Zhifeng Chen, Quoc\u00a0V. Le, and James Laudon. 2022. Mixture-of-Experts with Expert Choice Routing. In NeurIPS."},{"key":"e_1_3_3_3_63_2","unstructured":"Michael Zhu and Suyog Gupta. 2017. To prune or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1710.01878 (2017)."}],"event":{"name":"SC '25: The International Conference for High Performance Computing, Networking, Storage and Analysis","location":"St. Louis MO USA","acronym":"SC '25","sponsor":["SIGHPC ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing"]},"container-title":["Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3712285.3759775","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T18:44:13Z","timestamp":1773254653000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3712285.3759775"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,15]]},"references-count":62,"alternative-id":["10.1145\/3712285.3759775","10.1145\/3712285"],"URL":"https:\/\/doi.org\/10.1145\/3712285.3759775","relation":{},"subject":[],"published":{"date-parts":[[2025,11,15]]},"assertion":[{"value":"2025-11-15","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}