{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,9]],"date-time":"2026-07-09T15:23:34Z","timestamp":1783610614495,"version":"3.55.0"},"reference-count":242,"publisher":"Association for Computing Machinery (ACM)","issue":"10","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62441225 and U24A20234"],"award-info":[{"award-number":["62441225 and U24A20234"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Research Grants Council of the Hong Kong Special Administrative Region, China","award":["T45-401\/22-N"],"award-info":[{"award-number":["T45-401\/22-N"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2026,7,31]]},"abstract":"<jats:p>\n                    The emergence of large-scale Mixture of Experts (MoE) models represents a significant advancement in artificial intelligence, offering larger model capacity and computational efficiency through conditional computation. However, deploying and running inference on these models presents significant challenges in computational resources, latency, and energy efficiency. This comprehensive survey analyzes optimization techniques for MoE models across the entire system stack. We first establish a taxonomical framework that categorizes optimization approaches into model-level, system-level, and hardware-level optimizations. At the model level, we examine architectural innovations including efficient expert design, attention mechanisms, various compression techniques such as pruning, quantization, and knowledge distillation, as well as algorithm improvement including dynamic routing strategies and expert merging methods. At the system level, we investigate distributed computing approaches, load balancing mechanisms, and efficient scheduling algorithms that enable scalable deployment. Furthermore, we delve into hardware-specific optimizations and co-design strategies that maximize throughput and energy efficiency. This survey provides both a structured overview of existing solutions and identifies key challenges and promising research directions in MoE inference optimization. To facilitate ongoing updates and the sharing of cutting-edge advances in MoE inference optimization research, we have established a repository accessible at\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"url\" xlink:href=\"https:\/\/github.com\/MoE-Inf\/awesome-moe-inference\/\">https:\/\/github.com\/MoE-Inf\/awesome-moe-inference\/<\/jats:ext-link>\n                    .\n                  <\/jats:p>","DOI":"10.1145\/3794845","type":"journal-article","created":{"date-parts":[[2026,2,9]],"date-time":"2026-02-09T21:09:02Z","timestamp":1770671342000},"page":"1-37","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["A Survey on Inference Optimization Techniques for Mixture of Experts Models"],"prefix":"10.1145","volume":"58","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0378-2311","authenticated-orcid":false,"given":"Jiacheng","family":"Liu","sequence":"first","affiliation":[{"name":"Shanghai Jiao Tong University","place":["Shanghai, China"]},{"name":"The Chinese University of Hong Kong","place":["Shanghai, China"]},{"name":"Hong Kong University of Science and Technology","place":["Shanghai, China"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-8196-3953","authenticated-orcid":false,"given":"Peng","family":"Tang","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University","place":["Shanghai, China"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-8087-6135","authenticated-orcid":false,"given":"Wenfeng","family":"Wang","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University","place":["Shanghai, China"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-5083-4460","authenticated-orcid":false,"given":"Yuhang","family":"Ren","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University","place":["Shanghai, China"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4372-7851","authenticated-orcid":false,"given":"Xiaofeng","family":"Hou","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University","place":["Shanghai, China"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3055-5034","authenticated-orcid":false,"given":"Pheng Ann","family":"Heng","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong","place":["Hong Kong, Hong Kong"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0034-2302","authenticated-orcid":false,"given":"Minyi","family":"Guo","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Shanghai Jiao Tong University","place":["Shanghai, China"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6218-4659","authenticated-orcid":false,"given":"Chao","family":"Li","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Shanghai Jiao Tong University","place":["Shanghai, China"]}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2026,3,9]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Marah Abdin Jyoti Aneja Hany Awadalla Ahmed Awadallah Ammar Ahmad Awan Nguyen Bach Amit Bahree Arash Bakhtiari Jianmin Bao Harkirat Behl et\u00a0al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv:2404.14219. Retrieved from https:\/\/arxiv.org\/abs\/2404.14219 (2024)."},{"key":"e_1_3_1_3_2","unstructured":"Rishabh Agarwal Nino Vieillard Yongchao Zhou Piotr Stanczyk Sabela Ramos Garea Matthieu Geist and Olivier Bachem. 2024. On-policy distillation of language models: Learning from self-generated mistakes. The Twelfth International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=3zKtaqxLhW"},{"key":"e_1_3_1_4_2","unstructured":"Sandhini Agarwal Lama Ahmad Jason Ai Sam Altman Andy Applebaum Edwin Arbus Rahul K. Arora Yu Bai Bowen Baker Haiming Bao et\u00a0al. 2025. gpt-oss-120b & gpt-oss-20b model card. arXiv:2508.10925. Retrieved from https:\/\/arxiv.org\/abs\/2508.10925 (2025)."},{"key":"e_1_3_1_5_2","unstructured":"Maryam Akhavan Aghdam Hongpeng Jin and Yanzhao Wu. 2024. DA-MoE: Towards dynamic expert allocation for mixture-of-experts models. arxiv:2409.06669. Retrieved from https:\/\/arxiv.org\/abs\/2409.06669"},{"key":"e_1_3_1_6_2","unstructured":"Meta AI. 2024. PyTorch. Retrieved from https:\/\/pytorch.org\/"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.52202\/068431-1620"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/3644815.3644967"},{"key":"e_1_3_1_9_2","unstructured":"Anthropic. 2023. The Claude 3 Model Family: Opus Sonnet Haiku. Retrieved from https:\/\/www-cdn.anthropic.com\/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627\/Model_Card_Claude_3.pdf"},{"key":"e_1_3_1_10_2","doi-asserted-by":"crossref","unstructured":"Mikel Artetxe Shruti Bhosale Naman Goyal Todor Mihaylov Myle Ott Sam Shleifer Xi Victoria Lin Jingfei Du Srinivasan Iyer Ramakanth Pasunuru et\u00a0al. 2021. Efficient large scale language modeling with mixtures of experts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 11699\u201311732.","DOI":"10.18653\/v1\/2022.emnlp-main.804"},{"key":"e_1_3_1_11_2","unstructured":"Guangji Bai Zheng Chai Chen Ling Shiyu Wang Jiaying Lu Nan Zhang Tingwei Shi Ziyang Yu Mengdan Zhu Yifei Zhang et\u00a0al. 2024. Beyond efficiency: A systematic survey of resource-efficient large language models. arXiv:2401.00625. Retrieved from https:\/\/arxiv.org\/abs\/2401.00625 (2024)."},{"key":"e_1_3_1_12_2","unstructured":"Baidu-ERNIE-Team. 2025. ERNIE 4.5 Technical Report. Retrieved October 20 2025 from https:\/\/ernie.baidu.com\/blog\/publication\/ERNIE_Technical_Report.pdf"},{"key":"e_1_3_1_13_2","doi-asserted-by":"crossref","unstructured":"Ruisi Cai Yeonju Ro Geon-Woo Kim Peihao Wang Babak Ehteshami Bejnordi Aditya Akella and Zhangyang Wang. 2024. Read-ME: Refactorizing LLMs as router-decoupled mixture of experts with system co-design. Advances in Neural Information Processing Systems 37 (2024) 116126\u2013116148.","DOI":"10.52202\/079017-3687"},{"key":"e_1_3_1_14_2","unstructured":"Weilin Cai Juyong Jiang Le Qin Junwei Cui Sunghun Kim and Jiayi Huang. 2025. Shortcut-connected expert parallelism for accelerating mixture-of-experts. In Proceedings of the 42nd International Conference on Machine Learning. PMLR 267 (2025) 6211\u20136228."},{"key":"e_1_3_1_15_2","unstructured":"Weilin Cai Juyong Jiang Fan Wang Jing Tang Sunghun Kim and Jiayi Huang. 2025. A survey on mixture of experts in large language models. IEEE Transactions on Knowledge and Data Engineering 37 7 (2025) 3896\u20133915."},{"key":"e_1_3_1_16_2","doi-asserted-by":"crossref","unstructured":"Weilin Cai Le Qin and Jiayi Huang. 2024. MoC-System: Efficient fault tolerance for sparse mixture-of-experts model training. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems 2 (2024) 655\u2013671.","DOI":"10.1145\/3676641.3716006"},{"key":"e_1_3_1_17_2","doi-asserted-by":"crossref","unstructured":"Shiyi Cao Shu Liu Tyler Griggs Peter Schafhalter Xiaoxuan Liu Ying Sheng Joseph E. Gonzalez Matei Zaharia and Ion Stoica. 2024. MoE-Lightning: High-throughput MoE inference on memory-constrained GPUs. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems 1 (2024) 715\u2013730.","DOI":"10.1145\/3669940.3707267"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.52202\/068431-1611"},{"key":"e_1_3_1_19_2","unstructured":"I-Chun Chen Hsu-Shen Liu Wei-Fang Sun Chen-Hao Chao Yen-Chang Hsu Chun-Yi Lee . 2025. Retraining-free merging of sparse mixture-of-experts via hierarchical clustering. Forty-second International Conference on Machine Learning. https:\/\/openreview.net\/forumid=hslOzRxzXL"},{"key":"e_1_3_1_20_2","unstructured":"Tianyu Chen Shaohan Huang Yuan Xie Binxing Jiao Daxin Jiang Haoyi Zhou Jianxin Li and Furu Wei. 2022. Task-specific expert pruning for sparse mixture-of-experts. arXiv:2206.00277. Retrieved from https:\/\/arxiv.org\/abs\/2206.00277 (2022)."},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00276"},{"issue":"240","key":"e_1_3_1_22_2","first-page":"1","article-title":"Palm: Scaling language modeling with pathways","volume":"24","author":"Chowdhery Aakanksha","year":"2023","unstructured":"Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et\u00a0al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1\u2013113.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_1_23_2","unstructured":"Mohammed Nowaz Rabbani Chowdhury Meng Wang Kaoutar El Maghraoui Naigang Wang Pin-Yu Chen and Christopher Carothers. 2024. A provably effective method for pruning experts in fine-tuned sparse mixture-of-experts. International Conference on Machine Learning. 8815\u20138847."},{"key":"e_1_3_1_24_2","first-page":"6074","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Chowdhury Mohammed Nowaz Rabbani","year":"2023","unstructured":"Mohammed Nowaz Rabbani Chowdhury, Shuai Zhang, Meng Wang, Sijia Liu, and Pin-Yu Chen. 2023. Patch-level routing in mixture-of-experts is provably sample-efficient for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 6074\u20136114."},{"key":"e_1_3_1_25_2","unstructured":"Peizhuang Cong Aomufei Yuan Shimao Chen Yuxuan Tian Bowen Ye and Tong Yang. 2024. Prediction is all MoE needs: Expert load distribution goes from fluctuating to stabilizing. arXiv:2404.16914. Retrieved from https:\/\/arxiv.org\/abs\/2404.16914 (2024)."},{"key":"e_1_3_1_26_2","unstructured":"Marta R. Costa-juss\u00e0 James Cross Onur \u00c7elebi Maha Elbayad Kenneth Heafield Kevin Heffernan Elahe Kalbassi Janice Lam Daniel Licht Jean Maillard et\u00a0al. 2022. No language left behind: Scaling human-centered machine translation. arXiv:2207.04672. Retrieved from https:\/\/arxiv.org\/abs\/2207.04672 (2022)."},{"key":"e_1_3_1_27_2","doi-asserted-by":"crossref","unstructured":"Lu\u00eds Cruz Xavier Franch Gutierrez and Silverio Mart\u00ednez-Fern\u00e1ndez. 2025. Innovating for tomorrow: The convergence of SE and green AI. ACM Transactions on Software Engineering and Methodology 34 5 (2025) 1\u20133.","DOI":"10.1145\/3712007"},{"key":"e_1_3_1_28_2","doi-asserted-by":"crossref","unstructured":"R\u00f3bert Csord\u00e1s Kazuki Irie J\u00fcrgen Schmidhuber Christopher Potts and Christopher D. Manning. 2024. MoEUT: Mixture-of-experts universal transformers. Advances in Neural Information Processing Systems 37 (2024) 28589\u201328614.","DOI":"10.52202\/079017-0897"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.52202\/079017-2368"},{"key":"e_1_3_1_30_2","first-page":"797","volume-title":"Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)","author":"Cui Weihao","year":"2023","unstructured":"Weihao Cui, Zhenhua Han, Lingji Ouyang, Yichuan Wang, Ningxin Zheng, Lingxiao Ma, Yuqing Yang, Fan Yang, Jilong Xue, Lili Qiu, et\u00a0al. 2023. Optimizing dynamic neural networks with brainstorm. In Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, MA, 797\u2013815. Retrieved from https:\/\/www.usenix.org\/conference\/osdi23\/presentation\/cui"},{"key":"e_1_3_1_31_2","doi-asserted-by":"crossref","unstructured":"Damai Dai Chengqi Deng Chenggang Zhao R. X. Xu Huazuo Gao Deli Chen Jiashi Li Wangding Zeng Xingkai Yu Y. Wu et\u00a0al. 2024. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics 1 (2024) 1280\u20131297.","DOI":"10.18653\/v1\/2024.acl-long.70"},{"key":"e_1_3_1_32_2","unstructured":"DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arxiv:2501.12948. Retrieved from https:\/\/arxiv.org\/abs\/2501.12948"},{"key":"e_1_3_1_33_2","unstructured":"DeepSeek-AI Aixin Liu Bei Feng Bin Wang Bingxuan Wang Bo Liu Chenggang Zhao Chengqi Dengr Chong Ruan Damai Dai et\u00a0al. 2024. DeepSeek-V2: A strong economical and efficient mixture-of-experts language model. arxiv:2405.04434. Retrieved from https:\/\/arxiv.org\/abs\/2405.04434"},{"key":"e_1_3_1_34_2","unstructured":"DeepSeek-AI Aixin Liu Bei Feng Bing Xue Bingxuan Wang Bochao Wu Chengda Lu Chenggang Zhao Chengqi Deng Chenyu Zhang et\u00a0al. 2024. DeepSeek-V3 technical report. arXiv:2412.19437. Retrieved from https:\/\/arxiv.org\/abs\/2412.19437 (2024)."},{"key":"e_1_3_1_35_2","series-title":"Proceedings of Machine Learning Research","first-page":"7480","volume-title":"Proceedings of the 40th International Conference on Machine Learning","volume":"202","author":"Dehghani Mostafa","year":"2023","unstructured":"Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et\u00a0al. 2023. Scaling vision transformers to 22 billion parameters. In Proceedings of the 40th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 7480\u20137512. Retrieved from https:\/\/proceedings.mlr.press\/v202\/dehghani23a.html"},{"key":"e_1_3_1_36_2","doi-asserted-by":"crossref","unstructured":"Yifeng Ding Jiawei Liu Yuxiang Wei Terry Yue Zhuo and Lingming Zhang. 2024. XFT: Unlocking the power of code instruction tuning by simply merging upcycled mixture-of-experts. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics 1 (2024) 12941\u201312955.","DOI":"10.18653\/v1\/2024.acl-long.699"},{"key":"e_1_3_1_37_2","volume-title":"Proceedings of the 3rd international workshop on paraphrasing (IWP2005)","author":"Dolan Bill","year":"2005","unstructured":"Bill Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the 3rd international workshop on paraphrasing (IWP2005)."},{"key":"e_1_3_1_38_2","unstructured":"Alexey Dosovitskiy. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929 (2020)."},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.106"},{"key":"e_1_3_1_40_2","first-page":"5547","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Du Nan","year":"2022","unstructured":"Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et\u00a0al. 2022. Glam: Efficient scaling of language models with mixture-of-experts. In Proceedings of the International Conference on Machine Learning. PMLR, 5547\u20135569."},{"key":"e_1_3_1_41_2","first-page":"224","article-title":"SiDA: Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-experts models","volume":"6","author":"Du Zhixu","year":"2024","unstructured":"Zhixu Du, Shiyu Li, Yuhao Wu, Xiangyu Jiang, Jingwei Sun, Qilin Zheng, Yongkai Wu, Ang Li, Hai Li, and Yiran Chen. 2024. SiDA: Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-experts models. Proceedings of Machine Learning and Systems 6 (2024), 224\u2013238.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_1_42_2","unstructured":"David Eigen Marc\u2019Aurelio Ranzato and Ilya Sutskever. 2013. Learning factored representations in a deep mixture of experts. arXiv:1312.4314. Retrieved from https:\/\/arxiv.org\/abs\/1312.4314 (2013). Retrieved from https:\/\/api.semanticscholar.org\/CorpusID:11492613"},{"key":"e_1_3_1_43_2","unstructured":"Artyom Eliseev and Denis Mazur. 2023. Fast inference of mixture-of-experts language models with offloading. arXiv:2312.17238. Retrieved from https:\/\/arxiv.org\/abs\/2312.17238 (2023)."},{"key":"e_1_3_1_44_2","unstructured":"Ahmad Faiz Sotaro Kaneda Ruhan Wang Rita Osi Prateek Sharma Fan Chen and Lei Jiang. 2023. Llmcarbon: Modeling the end-to-end carbon footprint of large language models. The Twelfth International Conference on Learning Representations. arXiv:2309.14393. Retrieved from https:\/\/arxiv.org\/abs\/2309.14393 (2023)."},{"key":"e_1_3_1_45_2","first-page":"28441","article-title":"M3ViT: Mixture-of-experts vision transformer for efficient multi-task learning with model-accelerator co-design","volume":"35","author":"Fan Zhiwen","year":"2022","unstructured":"Zhiwen Fan, Rishov Sarkar, Ziyu Jiang, Tianlong Chen, Kai Zou, Yu Cheng, Cong Hao, Zhangyang Wang, Hanxue Liang . 2022. M3ViT: Mixture-of-experts vision transformer for efficient multi-task learning with model-accelerator co-design. Advances in Neural Information Processing Systems 35 (2022), 28441\u201328457.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_46_2","unstructured":"William Fedus Jeff Dean and Barret Zoph. 2022. A review of sparse expert models in deep learning. arXiv:2209.01667. Retrieved from https:\/\/arxiv.org\/abs\/2209.01667 (2022)."},{"issue":"120","key":"e_1_3_1_47_2","first-page":"1","article-title":"Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity","volume":"23","author":"Fedus William","year":"2022","unstructured":"William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research 23, 120 (2022), 1\u201339.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_1_48_2","unstructured":"Elias Frantar and Dan Alistarh. 2023. Qmoe: Practical sub-1-bit compression of trillion-parameter models. In Proceedings of the 5th MLSys Conference. arXiv:2310.16795. Retrieved from https:\/\/arxiv.org\/abs\/2310.16795 (2023)."},{"key":"e_1_3_1_49_2","unstructured":"Elias Frantar Saleh Ashkboos Torsten Hoefler and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. The Eleventh International Conference on Learning Representations. https:\/\/openreview.net\/forumid=tcbBPnfwxS"},{"key":"e_1_3_1_50_2","unstructured":"Yao Fu Yinsicheng Jiang Yeqi Huang Ping Nie Zhan Lu Leyang Xue Congjie He Man-Kit Sit Jilong Xue Li Dong et\u00a0al. 2024. MoE-CAP: Cost-accuracy-performance benchmarking for mixture-of-experts systems. arXiv:2412.07067. Retrieved from https:\/\/arxiv.org\/abs\/2412.07067 (2024)."},{"key":"e_1_3_1_51_2","doi-asserted-by":"crossref","unstructured":"Zhenxiao Fu Fan Chen Shan Zhou Haitong Li and Lei Jiang. 2025. LLMCO2: Advancing accurate carbon footprint prediction for LLM inferences. ACM SIGENERGY Energy Informatics Review 5 2 (2025) 63\u201368.","DOI":"10.1145\/3757892.3757901"},{"key":"e_1_3_1_52_2","unstructured":"Chongyang Gao Kezhen Chen Jinmeng Rao Baochen Sun Ruibo Liu Daiyi Peng Yawen Zhang Xiaoyuan Guo Jie Yang and VS Subrahmanian. 2024. Higher layers need more lora experts. arXiv:2402.08562. Retrieved from https:\/\/arxiv.org\/abs\/2402.08562 (2024)."},{"key":"e_1_3_1_53_2","unstructured":"Ze-Feng Gao Peiyu Liu Wayne Xin Zhao Zhong-Yi Lu and Ji-Rong Wen. 2022. Parameter-efficient mixture-of-experts architecture for pre-trained language models. In Proceedings of the 29th International Conference on Computational Linguistics. 3263\u20133273."},{"key":"e_1_3_1_54_2","unstructured":"Georgi Gerganov. 2023. Llama.cpp. Retrieved from https:\/\/github.com\/ggerganov\/llama.cpp"},{"key":"e_1_3_1_55_2","unstructured":"Yuxian Gu Li Dong Furu Wei and Minlie Huang. 2023. Knowledge distillation of large language models. The Twelfth International Conference on Learning Representations. https:\/\/openreview.net\/forum?id5h0qf7IBZZ"},{"key":"e_1_3_1_56_2","unstructured":"Yongxin Guo Zhenglin Cheng Xiaoying Tang Zhaopeng Tu and Tao Lin. 2024. Dynamic mixture of experts: An auto-tuning approach for efficient transformer models. The Thirteenth International Conference on Learning Representations. arXiv:2405.14297. Retrieved from https:\/\/arxiv.org\/abs\/2405.14297 (2024)."},{"key":"e_1_3_1_57_2","unstructured":"Vima Gupta Kartik Sinha Ada Gavrilovska and Anand Padmanabha Iyer. 2024. Lynx: Enabling efficient MoE inference through dynamic batch-aware expert selection. arXiv:2411.08982. Retrieved from https:\/\/arxiv.org\/abs\/2411.08982 (2024)."},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/34.142911"},{"key":"e_1_3_1_59_2","unstructured":"Jiaao He Jiezhong Qiu Aohan Zeng Zhilin Yang Jidong Zhai and Jie Tang. 2021. Fastmoe: A fast mixture-of-expert training system. arXiv:2103.13262. Retrieved from https:\/\/arxiv.org\/abs\/2103.13262 (2021)."},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503221.3508418"},{"key":"e_1_3_1_61_2","unstructured":"Shwai He Daize Dong Liang Ding and Ang Li. 2024. Demystifying the compression of mixture-of-experts through a unified framework. arXiv:2406.02500. Retrieved from https:\/\/arxiv.org\/abs\/2406.02500 (2024)."},{"key":"e_1_3_1_62_2","doi-asserted-by":"crossref","unstructured":"Shwai He Run-Ze Fan Liang Ding Li Shen Tianyi Zhou and Dacheng Tao. 2023. Merging experts into one: Improving computational efficiency of mixture of experts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 14685\u201314691.","DOI":"10.18653\/v1\/2023.emnlp-main.907"},{"key":"e_1_3_1_63_2","unstructured":"Xin He Shunkang Zhang Yuxin Wang Haiyan Yin Zihao Zeng Shaohuai Shi Zhenheng Tang Xiaowen Chu Ivor Tsang and Ong Yew Soon. 2024. ExpertFlow: Optimized expert activation and token allocation for efficient mixture-of-experts inference. arXiv:2410.17954. Retrieved from https:\/\/arxiv.org\/abs\/2410.17954 (2024)."},{"key":"e_1_3_1_64_2","unstructured":"Dan Hendrycks Collin Burns Steven Basart Andy Zou Mantas Mazeika Dawn Song and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=d7KBjmI3GmQ"},{"key":"e_1_3_1_65_2","article-title":"Improving efficiency in multi-modal autonomous embedded systems through adaptive gating","author":"Hou Xiaofeng","year":"2024","unstructured":"Xiaofeng Hou, Cheng Xu, Chao Li, Jiacheng Liu, Xuehan Tang, Kwang-Ting Cheng, and Minyi Guo. 2024. Improving efficiency in multi-modal autonomous embedded systems through adaptive gating. IEEE Transactions on Computers 74, 2 (2024), 691\u2013704.","journal-title":"IEEE Transactions on Computers"},{"key":"e_1_3_1_66_2","unstructured":"Haiyang Huang Newsha Ardalani Anna Sun Liu Ke Hsien-Hsin S. Lee Anjali Sridhar Shruti Bhosale Carole-Jean Wu and Benjamin Lee. 2023. Towards MoE deployment: Mitigating inefficiencies in mixture-of-expert (MoE) inference. arXiv:2303.06182. Retrieved from https:\/\/arxiv.org\/abs\/2303.06182 (2023)."},{"key":"e_1_3_1_67_2","unstructured":"Wei Huang Yue Liao Jianhui Liu Ruifei He Haoru Tan Shiming Zhang Hongsheng Li Si Liu and Xiaojuan Qi. 2024. MC-MoE: Mixture compressor for mixture-of-experts LLMs gains more. The Thirteenth International Conference on Learning Representations. https:\/\/openreview.net\/forumid=hheFYjOsWO"},{"key":"e_1_3_1_68_2","unstructured":"Yongqi Huang Peng Ye Xiaoshui Huang Sheng Li Tao Chen Tong He and Wanli Ouyang. 2023. Experts weights averaging: A new general training scheme for vision transformers. arXiv:2308.06093. Retrieved from https:\/\/arxiv.org\/abs\/2308.06093 (2023)."},{"key":"e_1_3_1_69_2","unstructured":"Huggingface. 2023. Transformers: State-of-the-art Machine Learning for JAX PyTorch and TensorFlow. Retrieved from https:\/\/github.com\/huggingface\/transformers"},{"key":"e_1_3_1_70_2","first-page":"269","article-title":"Tutel: Adaptive mixture-of-experts at scale","volume":"5","author":"Hwang Changho","year":"2023","unstructured":"Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et\u00a0al. 2023. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems 5 (2023), 269\u2013287.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_1_71_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA59077.2024.00078"},{"key":"e_1_3_1_72_2","doi-asserted-by":"publisher","DOI":"10.1145\/3580305.3599278"},{"key":"e_1_3_1_73_2","doi-asserted-by":"crossref","unstructured":"HamidReza Imani Abdolah Amirany and Tarek El-Ghazawi. 2024. Mixture of experts with mixture of precisions for tuning quality of service. IEEE International Conference on Rebooting Computing (ICRC\u201924). 1\u20136.","DOI":"10.1109\/ICRC64395.2024.10937027"},{"key":"e_1_3_1_74_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1991.3.1.79"},{"key":"e_1_3_1_75_2","first-page":"1","article-title":"Beyond data and model parallelism for deep neural networks.","volume":"1","author":"Jia Zhihao","year":"2019","unstructured":"Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond data and model parallelism for deep neural networks. Proceedings of Machine Learning and Systems 1 (2019), 1\u201313.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_1_76_2","unstructured":"Albert Q. Jiang Alexandre Sablayrolles Antoine Roux Arthur Mensch Blanche Savary Chris Bamford Devendra Singh Chaplot Diego de las Casas Emma Bou Hanna Florian Bressand et\u00a0al. 2024. Mixtral of experts. arXiv:2401.04088. Retrieved from https:\/\/arxiv.org\/abs\/2401.04088 (2024)."},{"key":"e_1_3_1_77_2","unstructured":"Peng Jin Bo Zhu Li Yuan and Shuicheng Yan. 2024. Moe++: Accelerating mixture-of-experts methods with zero-computation experts. International Conference on Learning Representations 2025 (2024) 50832\u201350856. https:\/\/proceedings.iclr.cc\/paper_files\/paper\/2025\/file\/7efe88bb4138d602e56637cfcf713654-Paper-Conference.pdf"},{"key":"e_1_3_1_78_2","unstructured":"Peng Jin Bo Zhu Li Yuan and Shuicheng Yan. 2025. MoH: Multi-head attention as mixture-of-head attention. Forty-second International Conference on Machine Learning. https:\/\/openreview.net\/forumid=eYtgs9k75o"},{"key":"e_1_3_1_79_2","unstructured":"Keisuke Kamahori Yile Gu Kan Zhu and Baris Kasikci. 2024. Fiddler: CPU-GPU orchestration for fast inference of mixture-of-experts models. The Thirteenth International Conference on Learning Representations. https:\/\/openreview.net\/forumid=N5fVv6PZGz"},{"key":"e_1_3_1_80_2","unstructured":"Jared Kaplan Sam McCandlish Tom Henighan Tom B. Brown Benjamin Chess Rewon Child Scott Gray Alec Radford Jeffrey Wu and Dario Amodei. 2020. Scaling laws for neural language models. arXiv:2001.08361. Retrieved from https:\/\/arxiv.org\/abs\/2001.08361 (2020)."},{"key":"e_1_3_1_81_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1023"},{"key":"e_1_3_1_82_2","unstructured":"Sungyoon Kim Youngjun Kim Kihyo Moon and Minsung Jang. 2024. Ladimo: Layer-wise distillation inspired moefier. arXiv:2408.04278. Retrieved from https:\/\/arxiv.org\/abs\/2408.04278 (2024)."},{"key":"e_1_3_1_83_2","doi-asserted-by":"publisher","DOI":"10.1145\/3649329.3655951"},{"key":"e_1_3_1_84_2","unstructured":"Young Jin Kim Raffy Fahim and Hany Hassan Awadalla. 2023. Mixture of quantized experts (MoQE): Complementary effect of low-bit quantization and robustness. arXiv:2310.02410. Retrieved from https:\/\/arxiv.org\/abs\/2310.02410 (2023)."},{"key":"e_1_3_1_85_2","doi-asserted-by":"crossref","unstructured":"Young Jin Kim Rawn Henry Raffy Fahim and Hany Hassan Awadalla. 2022. Who says elephants can\u2019t run: Bringing large scale MoE models into cloud scale production. In Proceedings of the Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP). 36\u201343.","DOI":"10.18653\/v1\/2022.sustainlp-1.6"},{"key":"e_1_3_1_86_2","unstructured":"Kimi Team. 2025. Kimi K2: Open Agentic Intelligence. Retrieved October 20 2025 from https:\/\/github.com\/MoonshotAI\/Kimi-K2\/blob\/main\/tech_report.pdf"},{"key":"e_1_3_1_87_2","doi-asserted-by":"crossref","unstructured":"Rui Kong Yuanchun Li Qingtian Feng Weijun Wang Linghe Kong and Yunxin Liu. 2023. Serving MoE models on resource-constrained edge devices via dynamic expert swapping. IEEE Transactions on Computers 74 8 (2025) 2799\u20132811.","DOI":"10.1109\/TC.2025.3575905"},{"key":"e_1_3_1_88_2","unstructured":"Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. Master\u2019s thesis University of Tront (2009)."},{"key":"e_1_3_1_89_2","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_1_90_2","doi-asserted-by":"crossref","unstructured":"Jaeseong Lee Aurick Qiao Daniel F. Campos Zhewei Yao Yuxiong He and Seung-won hwang.. 2025. STUN: Structured-then-unstructured pruning for scalable MoE pruning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics 1 (2025) 13660\u201313676.","DOI":"10.18653\/v1\/2025.acl-long.671"},{"key":"e_1_3_1_91_2","unstructured":"Dmitry Lepikhin HyoukJoong Lee Yuanzhong Xu Dehao Chen Orhan Firat Yanping Huang Maxim Krikun Noam Shazeer and Zhifeng Chen. 2021. Gshard: Scaling giant models with conditional computation and automatic sharding. International Conference on Learning Representations. https:\/\/openreview.net\/forumid=qrwe7XHTmYb"},{"key":"e_1_3_1_92_2","first-page":"6265","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Lewis Mike","year":"2021","unstructured":"Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. 2021. Base layers: Simplifying training of large, sparse models. In Proceedings of the International Conference on Machine Learning. PMLR, 6265\u20136274."},{"key":"e_1_3_1_93_2","doi-asserted-by":"crossref","unstructured":"Baolin Li Yankai Jiang Vijay Gadepally and Devesh Tiwari. 2024. Llm inference serving: Survey of recent advances and opportunities. IEEE High Performance Extreme Computing Conference. 1\u20138.","DOI":"10.1109\/HPEC62836.2024.10938426"},{"key":"e_1_3_1_94_2","first-page":"945","volume-title":"Proceedings of the 2023 USENIX Annual Technical Conference (USENIX ATC 23)","author":"Li Jiamin","year":"2023","unstructured":"Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. 2023. Accelerating distributed MoE training and inference with lina. In Proceedings of the 2023 USENIX Annual Technical Conference (USENIX ATC 23). USENIX Association, Boston, MA, 945\u2013959. Retrieved from https:\/\/www.usenix.org\/conference\/atc23\/presentation\/li-jiamin"},{"key":"e_1_3_1_95_2","first-page":"9694","article-title":"Align before fuse: Vision and language representation learning with momentum distillation","volume":"34","author":"Li Junnan","year":"2021","unstructured":"Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems 34 (2021), 9694\u20139705.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_96_2","doi-asserted-by":"crossref","unstructured":"Jiamin Li Qiang Su Yitao Yang Yimin Jiang Cong Wang and Hong Xu. 2023. Adaptive gating in mixture-of-experts based language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 3577\u20133587.","DOI":"10.18653\/v1\/2023.emnlp-main.217"},{"key":"e_1_3_1_97_2","unstructured":"Jing Li Zhijie Sun Xuan He Li Zeng Yi Lin Entong Li Binfan Zheng Rongqian Zhao and Xin Chen. 2024. Locmoe: A low-overhead moe for large language model training. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. 6377\u20136387."},{"key":"e_1_3_1_98_2","doi-asserted-by":"crossref","unstructured":"Jialong Li Shreyansh Tripathi Lakshay Rastogi Yiming Lei Rui Pan and Yiting Xia. 2025. Optimizing mixture-of-experts inference time combining model deployment and communication scheduling. IEEE Transactions on Networking 34 (2025) 2478\u20132497.","DOI":"10.1109\/TON.2025.3645806"},{"key":"e_1_3_1_99_2","unstructured":"Jinhao Li Jiaming Xu Shan Huang Yonghua Chen Wen Li Jun Liu Yaoxiu Lian Jiayi Pan Li Ding Hao Zhou et\u00a0al. 2024. Large language model inference acceleration: A comprehensive hardware perspective. arXiv:2410.04466. Retrieved from https:\/\/arxiv.org\/abs\/2410.04466 (2024)."},{"key":"e_1_3_1_100_2","unstructured":"Margaret Li Suchin Gururangan Tim Dettmers Mike Lewis Tim Althoff Noah A. Smith and Luke Zettlemoyer. 2022. Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv:2208.03306. Retrieved from https:\/\/arxiv.org\/abs\/2208.03306 (2022)."},{"key":"e_1_3_1_101_2","unstructured":"Pingzhi Li Xiaolong Jin Yu Cheng and Tianlong Chen. 2024. Examining post-training quantization for mixture-of-experts: A benchmark. arXiv:2406.08155. Retrieved from https:\/\/arxiv.org\/abs\/2406.08155 (2024)."},{"key":"e_1_3_1_102_2","unstructured":"Pingzhi Li Zhenyu Zhang Prateek Yadav Yi-Lin Sung Yu Cheng Mohit Bansal and Tianlong Chen. 2024. Merge then compress: Demystify efficient SMoe with hints from its routing policy. The Twelfth International Conference on Learning Representations. https:\/\/openreview.net\/forumid=eFWG9Cy3WK"},{"key":"e_1_3_1_103_2","first-page":"20336","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Li Yixiao","year":"2023","unstructured":"Yixiao Li, Yifan Yu, Qingru Zhang, Chen Liang, Pengcheng He, Weizhu Chen, and Tuo Zhao. 2023. Losparse: Structured compression of large language models based on low-rank and sparse approximation. In Proceedings of the International Conference on Machine Learning. PMLR, 20336\u201320350."},{"key":"e_1_3_1_104_2","unstructured":"Zhiding Liang Jinglei Cheng Rui Yang Hang Ren Zhixin Song Di Wu Xuehai Qian Tongyang Li and Yiyu Shi. 2023. Unleashing the potential of llms for quantum computing: A study in quantum architecture design. arXiv:2307.08191. Retrieved from https:\/\/arxiv.org\/abs\/2307.08191 (2023)."},{"key":"e_1_3_1_105_2","unstructured":"Opher Lieber Barak Lenz Hofit Bata Gal Cohen Jhonathan Osin Itay Dalmedigos Erez Safahi Shaked Meirom Yonatan Belinkov Shai Shalev-Shwartz et\u00a0al. 2024. Jamba: A hybrid transformer-mamba language model. arXiv:2403.19887. Retrieved from https:\/\/arxiv.org\/abs\/2403.19887 (2024)."},{"key":"e_1_3_1_106_2","doi-asserted-by":"publisher","DOI":"10.1145\/3649329.3656507"},{"key":"e_1_3_1_107_2","unstructured":"Enshu Liu Junyi Zhu Zinan Lin Xuefei Ning Matthew B. Blaschko Shengen Yan Guohao Dai Huazhong Yang and Yu Wang. 2024. Efficient expert pruning for sparse mixture-of-experts language models: Enhancing performance and reducing inference costs. arXiv:2407.00945. Retrieved from https:\/\/arxiv.org\/abs\/2407.00945 (2024)."},{"key":"e_1_3_1_108_2","doi-asserted-by":"publisher","DOI":"10.1145\/3603269.3604869"},{"key":"e_1_3_1_109_2","unstructured":"Liyuan Liu Young Jin Kim Shuohang Wang Chen Liang Yelong Shen Hao Cheng Xiaodong Liu Masahiro Tanaka Xiaoxia Wu Wenxiang Hu et\u00a0al. 2024. GRIN: GRadient-INformed MoE. arXiv:2409.12136. Retrieved from https:\/\/arxiv.org\/abs\/2409.12136 (2024)."},{"key":"e_1_3_1_110_2","doi-asserted-by":"crossref","unstructured":"Mengfan Liu Wei Wang and Chuan Wu. 2025. Optimizing distributed deployment of mixture-of-experts model inference in serverless computing. IEEE Infocom 2025-IEEE Conference on Computer Communications. 1\u201310.","DOI":"10.1109\/INFOCOM55648.2025.11044553"},{"key":"e_1_3_1_111_2","doi-asserted-by":"crossref","unstructured":"Qidong Liu Xian Wu Xiangyu Zhao Yuanshao Zhu Derong Xu Feng Tian and Yefeng Zheng. 2024. When MOE meets LLMs: Parameter efficient fine-tuning for multi-task medical applications. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104\u20131114.","DOI":"10.1145\/3626772.3657722"},{"key":"e_1_3_1_112_2","first-page":"13782","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Liu Rui","year":"2022","unstructured":"Rui Liu, Young Jin Kim, Alexandre Muzio, and Hany Hassan. 2022. Gating dropout: Communication-efficient regularization for sparsely activated transformers. In Proceedings of the International Conference on Machine Learning. PMLR, 13782\u201313792."},{"key":"e_1_3_1_113_2","unstructured":"Zefang Liu and Jiahua Luo. 2024. AdaMoLE: Fine-tuning large language models with adaptive mixture of low-rank adaptation experts. First Conference on Language Modeling. https:\/\/openreview.net\/forumid=ndY9qFf9Sa"},{"key":"e_1_3_1_114_2","first-page":"22631","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Longpre Shayne","year":"2023","unstructured":"Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei, et\u00a0al. 2023. The flan collection: Designing data and methods for effective instruction tuning. In Proceedings of the International Conference on Machine Learning. PMLR, 22631\u201322648."},{"key":"e_1_3_1_115_2","doi-asserted-by":"crossref","unstructured":"Xudong Lu Qi Liu Yuhui Xu Aojun Zhou Siyuan Huang Bo Zhang Junchi Yan and Hongsheng Li. 2024. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics 1 (2024) 6159\u20136172.","DOI":"10.18653\/v1\/2024.acl-long.334"},{"key":"e_1_3_1_116_2","unstructured":"Tongxu Luo Jiahe Lei Fangyu Lei Weihao Liu Shizhu He Jun Zhao and Kang Liu. 2024. Moelora: Contrastive learning guided mixture of experts on parameter-efficient fine-tuning for large language models. arXiv:2402.12851. Retrieved from https:\/\/arxiv.org\/abs\/2402.12851 (2024)."},{"key":"e_1_3_1_117_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503221.3508417"},{"key":"e_1_3_1_118_2","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2024.3437365"},{"key":"e_1_3_1_119_2","doi-asserted-by":"crossref","unstructured":"Xin Men Mingyu Xu Qingyu Zhang Bingning Wang Hongyu Lin Yaojie Lu Xianpei Han and Weipeng Chen. 2025. Shortgpt: Layers in large language models are more redundant than you expect. In Findings of the Association for Computational Linguistics: ACL 2025. 20192\u201320204.","DOI":"10.18653\/v1\/2025.findings-acl.1035"},{"key":"e_1_3_1_120_2","unstructured":"Meta AI. 2025. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. Retrieved October 20 2025 from https:\/\/ai.meta.com\/blog\/llama-4-multimodal-intelligence\/"},{"key":"e_1_3_1_121_2","unstructured":"Microsoft. 2023. DeepSpeed. Retrieved from https:\/\/github.com\/microsoft\/DeepSpeed"},{"key":"e_1_3_1_122_2","unstructured":"MiniMax Aonian Li Bangwei Gong Bo Yang Boji Shan Chang Liu Cheng Zhu Chunhao Zhang Congchao Guo Da Chen et\u00a0al. 2025. MiniMax-01: Scaling foundation models with lightning attention. arxiv:2501.08313. Retrieved from https:\/\/arxiv.org\/abs\/2501.08313"},{"key":"e_1_3_1_123_2","unstructured":"Niklas Muennighoff Luca Soldaini Dirk Groeneveld Kyle Lo Jacob Morrison Sewon Min Weijia Shi Pete Walsh Oyvind Tafjord Nathan Lambert et\u00a0al. 2024. OLMoE: Open mixture-of-experts language models. arXiv:2409.02060. Retrieved from https:\/\/arxiv.org\/abs\/2409.02060 (2024)."},{"key":"e_1_3_1_124_2","unstructured":"Alexandre Muzio Alex Sun and Churan He. 2024. SEER-MoE: Sparse expert efficiency through regularization for mixture-of-experts. arXiv:2404.05089. Retrieved from https:\/\/arxiv.org\/abs\/2404.05089 (2024)."},{"key":"e_1_3_1_125_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_3_1_126_2","unstructured":"Nam V. Nguyen Thong T. Doan Luong Tran Van Nguyen and Quang Pham. 2024. LIBMoE: A library for comprehensive benchmarking mixture of experts in large language models. arXiv:2411.00918. Retrieved from https:\/\/arxiv.org\/abs\/2411.00918 (2024)."},{"key":"e_1_3_1_127_2","doi-asserted-by":"publisher","DOI":"10.1145\/3588964"},{"key":"e_1_3_1_128_2","unstructured":"Xiaonan Nie Pinxue Zhao Xupeng Miao Tong Zhao and Bin Cui. 2022. HetuMoE: An efficient trillion-scale mixture-of-expert distributed training system. arXiv:2203.14685. Retrieved from https:\/\/arxiv.org\/abs\/2203.14685 (2022)."},{"key":"e_1_3_1_129_2","doi-asserted-by":"publisher","DOI":"10.1109\/JLT.2024.3427716"},{"key":"e_1_3_1_130_2","unstructured":"NVIDIA. 2019. FasterTransformer. Retrieved from https:\/\/github.com\/NVIDIA\/FasterTransformer"},{"key":"e_1_3_1_131_2","unstructured":"OpenAI. 2023. Gpt-4 technical report. arXiv:2303.08774. Retrieved from https:\/\/arxiv.org\/abs\/2303.08774 (2023)."},{"key":"e_1_3_1_132_2","first-page":"27730","article-title":"Training language models to follow instructions with human feedback","volume":"35","author":"Ouyang Long","year":"2022","unstructured":"Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et\u00a0al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730\u201327744.","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"1","key":"e_1_3_1_133_2","first-page":"14","article-title":"Quantum computing and AI in the cloud","volume":"4","author":"Padmanaban Harish","year":"2024","unstructured":"Harish Padmanaban. 2024. Quantum computing and AI in the cloud. Journal of Computational Intelligence and Robotics 4, 1 (2024), 14\u201332.","journal-title":"Journal of Computational Intelligence and Robotics"},{"key":"e_1_3_1_134_2","unstructured":"Bowen Pan Yikang Shen Haokun Liu Mayank Mishra Gaoyuan Zhang Aude Oliva Colin Raffel and Rameswar Panda. 2024. Dense training sparse inference: Rethinking training of mixture-of-experts language models. arXiv:2404.05567. Retrieved from https:\/\/arxiv.org\/abs\/2404.05567 (2024)."},{"key":"e_1_3_1_135_2","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM52122.2024.10621327"},{"key":"e_1_3_1_136_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC49657.2024.10454487"},{"key":"e_1_3_1_137_2","unstructured":"Sejik Park. 2024. Learning more generalized experts by merging experts in mixture-of-experts. arXiv:2405.11530. Retrieved from https:\/\/arxiv.org\/abs\/2405.11530 (2024)."},{"key":"e_1_3_1_138_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.587"},{"key":"e_1_3_1_139_2","unstructured":"Joan Puigcerver Carlos Riquelme Basil Mustafa and Neil Houlsby. 2023. From sparse to soft mixtures of experts. arXiv:2308.00951. Retrieved from https:\/\/arxiv.org\/abs\/2308.00951 (2023)."},{"key":"e_1_3_1_140_2","unstructured":"Yulei Qian Fengcun Li Xiangyang Ji Xiaoyu Zhao Jianchao Tan Kefeng Zhang and Xunliang Cai. 2024. EPS-MoE: Expert pipeline scheduler for cost-efficient MoE inference. arXiv:2410.12247. Retrieved from https:\/\/arxiv.org\/abs\/2410.12247 (2024)."},{"key":"e_1_3_1_141_2","doi-asserted-by":"crossref","unstructured":"Zihan Qiu Zeyu Huang Bo Zheng Kaiyue Wen Zekun Wang Rui Men Ivan Titov Dayiheng Liu Jingren Zhou and Junyang Lin. 2025. Demons in the detail: On implementing load balancing loss for training specialized mixture-of-expert models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics 1 (2025) 5005\u20135018. https:\/\/aclanthology.org\/2025.acl-long.249\/","DOI":"10.18653\/v1\/2025.acl-long.249"},{"key":"e_1_3_1_142_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et\u00a0al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748\u20138763."},{"key":"e_1_3_1_143_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPWRS.2022.3173250"},{"issue":"140","key":"e_1_3_1_144_2","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 1\u201367.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_1_145_2","series-title":"Proceedings of Machine Learning Research","first-page":"18332","volume-title":"Proceedings of the 39th International Conference on Machine Learning","volume":"162","author":"Rajbhandari Samyam","year":"2022","unstructured":"Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale. In Proceedings of the 39th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 18332\u201318346. Retrieved from https:\/\/proceedings.mlr.press\/v162\/rajbhandari22a.html"},{"key":"e_1_3_1_146_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_3_1_147_2","doi-asserted-by":"crossref","unstructured":"P Rajpurkar. 2016. Squad: 100 000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics 2383\u20132392. https:\/\/aclanthology.org\/D16-1264\/","DOI":"10.18653\/v1\/D16-1264"},{"key":"e_1_3_1_148_2","doi-asserted-by":"publisher","DOI":"10.1186\/s40537-024-00920-x"},{"key":"e_1_3_1_149_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA57654.2024.00066"},{"key":"e_1_3_1_150_2","unstructured":"Facebook AI Research. 2019. Fairseq. Retrieved from https:\/\/github.com\/facebookresearch\/fairseq"},{"key":"e_1_3_1_151_2","unstructured":"Snowflake AI Research. 2024. Snowflake Arctic: The Best LLM for Enterprise AI \u2013 Efficiently Intelligent Truly Open. Retrieved from https:\/\/www.snowflake.com\/en\/blog\/arctic-open-efficient-foundation-language-models-snowflake\/"},{"key":"e_1_3_1_152_2","volume-title":"Proceedings of the 2011 AAAI spring symposium series","author":"Roemmele Melissa","year":"2011","unstructured":"Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Proceedings of the 2011 AAAI spring symposium series."},{"key":"e_1_3_1_153_2","first-page":"17555","article-title":"Hash layers for large sparse models","volume":"34","author":"Roller Stephen","year":"2021","unstructured":"Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, and Jason Weston. 2021. Hash layers for large sparse models. Advances in Neural Information Processing Systems 34 (2021), 17555\u201317566.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_154_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_1_155_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474381"},{"key":"e_1_3_1_156_2","volume-title":"Knowledge distillation for mixture of experts models in speech recognition","author":"Salinas Felipe Cruz","year":"2022","unstructured":"Felipe Cruz Salinas, Kenichi Kumatani, Robert Gmyr, Linquan Liu, and Yu Shi. 2022. Knowledge distillation for mixture of experts models in speech recognition. Technical Report. Technical Report MSR-TR-2022-6, Microsoft Research, May 2022. Retrieved from https:\/\/www\u223c$\u2216protect\u2216TU\u2216textellipsis$"},{"key":"e_1_3_1_157_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD57390.2023.10323651"},{"key":"e_1_3_1_158_2","unstructured":"Soumajyoti Sarkar Leonard Lausen Volkan Cevher Sheng Zha Thomas Brox and George Karypis. 2024. Revisiting SMoE language models by evaluating inefficiencies with task specific expert pruning. arXiv:2409.01483. Retrieved from https:\/\/arxiv.org\/abs\/2409.01483 (2024)."},{"key":"e_1_3_1_159_2","doi-asserted-by":"publisher","DOI":"10.1038\/s43588-021-00184-y"},{"key":"e_1_3_1_160_2","unstructured":"Catherine D. Schuman Thomas E. Potok Robert M. Patton J. Douglas Birdwell Mark E. Dean Garrett S. Rose and James S. Plank. 2017. A survey of neuromorphic computing and neural networks in hardware. arXiv:1705.06963. Retrieved from https:\/\/arxiv.org\/abs\/1705.06963 (2017)."},{"key":"e_1_3_1_161_2","unstructured":"Noam Shazeer Azalia Mirhoseini Krzysztof Maziarz Andy Davis Quoc Le Geoffrey Hinton and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. International Conference on Learning Representations. https:\/\/openreview.net\/forumid=B1ckMDqlg"},{"key":"e_1_3_1_162_2","unstructured":"Liang Shen Zhihua Wu WeiBao Gong Hongxiang Hao Yangfan Bai HuaChao Wu Xinxuan Wu Jiang Bian Haoyi Xiong Dianhai Yu et\u00a0al. 2022. Se-moe: A scalable and efficient mixture-of-experts distributed training and inference system. arXiv:2205.10034. Retrieved from https:\/\/arxiv.org\/abs\/2205.10034 (2022)."},{"key":"e_1_3_1_163_2","unstructured":"Yikang Shen Zhen Guo Tianle Cai and Zengyi Qin. 2024. Jetmoe: Reaching llama2 performance with 0.1 m dollars. arXiv:2404.07413. Retrieved from https:\/\/arxiv.org\/abs\/2404.07413 (2024)."},{"key":"e_1_3_1_164_2","unstructured":"Yikang Shen Zheyu Zhang Tianyou Cao Shawn Tan Zhenfang Chen and Chuang Gan. 2023. Moduleformer: Learning modular large language models from uncurated data. arXiv:2306.04640. Retrieved from https:\/\/arxiv.org\/abs\/2306.04640 (2023)."},{"key":"e_1_3_1_165_2","first-page":"31094","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Sheng Ying","year":"2023","unstructured":"Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher R\u00e9, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. In Proceedings of the International Conference on Machine Learning. PMLR, 31094\u201331116."},{"key":"e_1_3_1_166_2","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM53939.2023.10228874"},{"key":"e_1_3_1_167_2","doi-asserted-by":"publisher","DOI":"10.1145\/3627703.3650083"},{"key":"e_1_3_1_168_2","unstructured":"Fangxun Shu Yue Liao Le Zhuo Chenning Xu Lei Zhang Guanghao Zhang Haonan Shi Long Chen Tao Zhong Wanggui He et\u00a0al. 2025. Llava-mod: Making llava tiny via moe knowledge distillation. The Thirteenth International Conference on Learning Representations. https:\/\/openreview.net\/forumid=uWtLOy35WD"},{"key":"e_1_3_1_169_2","doi-asserted-by":"publisher","DOI":"10.1145\/3577193.3593704"},{"key":"e_1_3_1_170_2","unstructured":"Andrii Skliar Ties van Rozendaal Romain Lepert Todor Boinovski Mart van Baalen Markus Nagel Paul Whatmough and Babak Ehteshami Bejnordi. 2025. Mixture of cache-conditional experts for efficient mobile device inference. Transactions on Machine Learning Research. https:\/\/openreview.net\/forumid=ul4W26KEKz"},{"key":"e_1_3_1_171_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D13-1170"},{"key":"e_1_3_1_172_2","unstructured":"Xiaoniu Song Zihang Zhong and Rong Chen. 2024. ProMoE: Fast MoE-based LLM serving using proactive caching. arXiv:2410.22134. Retrieved from https:\/\/arxiv.org\/abs\/2410.22134 (2024)."},{"key":"e_1_3_1_173_2","doi-asserted-by":"publisher","DOI":"10.1088\/0034-4885\/61\/2\/002"},{"key":"e_1_3_1_174_2","unstructured":"Sainbayar Sukhbaatar Olga Golovneva Vasu Sharma Hu Xu Xi Victoria Lin Baptiste Rozi\u00e8re Jacob Kahn Daniel Li Wen-tau Yih Jason Weston et\u00a0al. 2024. Branch-Train-MiX: Mixing expert LLMs into a mixture-of-experts LLM. First Conference on Language Modeling. https:\/\/openreview.net\/forum?id=nqLAuMOF6n"},{"key":"e_1_3_1_175_2","unstructured":"Mengshu Sun Haoyu Ma Guoliang Kang Yifan Jiang Tianlong Chen Xiaolong Ma Zhangyang Wang and Yanzhi Wang. 2022. Vaqf: Fully automatic software-hardware co-design framework for low-bit vision transformer. arXiv:2201.06618. Retrieved from https:\/\/arxiv.org\/abs\/2201.06618 (2022)."},{"key":"e_1_3_1_176_2","unstructured":"Xingwu Sun Yanfeng Chen Yiqing Huang Ruobing Xie Jiaqi Zhu Kai Zhang Shuaipeng Li Zhen Yang Jonny Han Xiaobo Shu et\u00a0al. 2024. Hunyuan-large: An open-source MoE model with 52 Billion activated parameters by Tencent. arXiv:2411.02265. Retrieved from https:\/\/arxiv.org\/abs\/2411.02265 (2024)."},{"key":"e_1_3_1_177_2","doi-asserted-by":"crossref","unstructured":"Shawn Tan Yikang Shen Zhenfang Chen Aaron Courville and Chuang Gan. 2023. Sparse universal transformer. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 169\u2013179.","DOI":"10.18653\/v1\/2023.emnlp-main.12"},{"key":"e_1_3_1_178_2","unstructured":"Peng Tang Jiacheng Liu Xiaofeng Hou Yifei Pu Jing Wang Pheng-Ann Heng Chao Li and Minyi Guo. 2024. HOBBIT: A mixed precision expert offloading system for fast MoE inference. arXiv:2411.01433. Retrieved from https:\/\/arxiv.org\/abs\/2411.01433 (2024)."},{"key":"e_1_3_1_179_2","unstructured":"Rohan Taori Ishaan Gulrajani Tianyi Zhang Yann Dubois Xuechen Li Carlos Guestrin Percy Liang and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. (2023)."},{"key":"e_1_3_1_180_2","unstructured":"Gemini Team Petko Georgiev Ving Ian Lei Ryan Burnell Libin Bai Anmol Gulati Garrett Tanzer Damien Vincent Zhufeng Pan Shibo Wang et\u00a0al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530. Retrieved from https:\/\/arxiv.org\/abs\/2403.05530 (2024)."},{"key":"e_1_3_1_181_2","unstructured":"Google Brain Team. 2024. TensorFlow. Retrieved from http:\/\/tensorflow.org\/"},{"key":"e_1_3_1_182_2","unstructured":"Meituan LongCat Team Bei Li Bingye Lei Bo Wang Bolin Rong Chao Wang Chao Zhang Chen Gao Chen Zhang Cheng Sun et\u00a0al. 2025. LongCat-flash technical report. arXiv:2509.01322. Retrieved from https:\/\/arxiv.org\/abs\/2509.01322 (2025)."},{"key":"e_1_3_1_183_2","unstructured":"Qwen Team. 2024. Qwen1.5-MoE: Matching 7B Model Performance with 1\/3 Activated Parameters. Retrieved from https:\/\/qwenlm.github.io\/blog\/qwen-moe\/"},{"key":"e_1_3_1_184_2","unstructured":"The Mosaic Research Team. 2024. Introducing DBRX: A New State-of-the-Art Open LLM. Retrieved from https:\/\/www.databricks.com\/blog\/introducing-dbrx-new-state-art-open-llm"},{"key":"e_1_3_1_185_2","unstructured":"Tencent Hunyuan Team. 2025. Hunyuan-A13B Technical Report. Retrieved October 20 2025 from https:\/\/github.com\/Tencent-Hunyuan\/Hunyuan-A13B\/blob\/main\/report\/Hunyuan_A13B_Technical_Report.pdf"},{"key":"e_1_3_1_186_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","volume":"13","author":"Tresp Volker","year":"2000","unstructured":"Volker Tresp. 2000. Mixtures of gaussian processes. In Proceedings of the Advances in Neural Information Processing Systems, T. Leen, T. Dietterich, and V. Tresp (Eds.), Vol. 13. MIT Press."},{"key":"e_1_3_1_187_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00577"},{"key":"e_1_3_1_188_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017)."},{"key":"e_1_3_1_189_2","doi-asserted-by":"publisher","DOI":"10.20944\/preprints202408.0583.v2"},{"key":"e_1_3_1_190_2","unstructured":"Zhongwei Wan Xin Wang Che Liu Samiul Alam Yu Zheng Jiachen Liu Zhongnan Qu Shen Yan Yi Zhu Quanlu Zhang et\u00a0al. 2024. Efficient large language models: A survey. Transactions on Machine Learning Research. https:\/\/openreview.net\/forumid=bsCCJHbO8A"},{"key":"e_1_3_1_191_2","doi-asserted-by":"crossref","unstructured":"Alex Wang. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 353\u2013355.","DOI":"10.18653\/v1\/W18-5446"},{"key":"e_1_3_1_192_2","unstructured":"Hongyu Wang Shuming Ma Li Dong Shaohan Huang Huaijie Wang Lingxiao Ma Fan Yang Ruiping Wang Yi Wu and Furu Wei. 2023. Bitnet: Scaling 1-bit transformers for large language models. arXiv:2310.11453. Retrieved from https:\/\/arxiv.org\/abs\/2310.11453 (2023)."},{"key":"e_1_3_1_193_2","unstructured":"Hongyi Wang Felipe Maia Polo Yuekai Sun Souvik Kundu Eric Xing and Mikhail Yurochkin. 2024. Fusing models with complementary expertise. The Twelfth International Conference on Learning Representations. https:\/\/openreview.net\/forumid=PhMrGCMIRL"},{"key":"e_1_3_1_194_2","first-page":"23318","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Wang Peng","year":"2022","unstructured":"Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Proceedings of the International Conference on Machine Learning. PMLR, 23318\u201323340."},{"key":"e_1_3_1_195_2","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER52292.2023.00015"},{"key":"e_1_3_1_196_2","first-page":"552","volume-title":"Proceedings of the Uncertainty in Artificial Intelligence","author":"Wang Xin","year":"2020","unstructured":"Xin Wang, Fisher Yu, Lisa Dunlap, Yi-An Ma, Ruth Wang, Azalia Mirhoseini, Trevor Darrell, and Joseph E. Gonzalez. 2020. Deep mixture of experts via shallow embedding. In Proceedings of the Uncertainty in Artificial Intelligence. PMLR, 552\u2013562."},{"key":"e_1_3_1_197_2","unstructured":"Xin Wang Yu Zheng Zhongwei Wan and Mi Zhang. 2025. Svd-llm: Truncation-aware singular value decomposition for large language model compression. The Thirteenth International Conference on Learning Representations. https:\/\/openreview.net\/forumid=LNYIUouhdt"},{"key":"e_1_3_1_198_2","unstructured":"Tianwen Wei Bo Zhu Liang Zhao Cheng Cheng Biye Li Weiwei L\u00fc Peng Cheng Jianhao Zhang Xiaoyu Zhang Liang Zeng et\u00a0al. 2024. Skywork-MoE: A deep dive into training techniques for mixture-of-experts language models. arXiv:2406.06563. Retrieved from https:\/\/arxiv.org\/abs\/2406.06563 (2024)."},{"key":"e_1_3_1_199_2","unstructured":"Shaohua Wu Jiangang Luo Xi Chen Lingjun Li Xudong Zhao Tong Yu Chao Wang Yue Wang Fei Wang Weixu Qiao et\u00a0al. 2024. Yuan 2.0-M32: Mixture of experts with attention router. arXiv:2405.17976. Retrieved from https:\/\/arxiv.org\/abs\/2405.17976 (2024)."},{"key":"e_1_3_1_200_2","unstructured":"Yongji Wu Wenjie Qu Tianyang Tao Zhuang Wang Wei Bai Zhuohao Li Yuan Tian Jiaheng Zhang Matthew Lentz and Danyang Zhuo. 2024. Lazarus: Resilient and elastic training of mixture-of-experts models with adaptive expert placement. arXiv:2407.04656. Retrieved from https:\/\/arxiv.org\/abs\/2407.04656 (2024)."},{"key":"e_1_3_1_201_2","unstructured":"xAI. 2024. Open Release of Grok-1. Retrieved from https:\/\/x.ai\/blog\/grok-os"},{"key":"e_1_3_1_202_2","unstructured":"Xinfeng Xia Jiacheng Liu Xiaofeng Hou Peng Tang Mingxuan Zhang Wenfeng Wang and Chao Li. 2025. MoE-Prism: Disentangling monolithic experts for elastic MoE services via model-system co-designs. arXiv:2510.19366. Retrieved from https:\/\/arxiv.org\/abs\/2510.19366 (2025)."},{"key":"e_1_3_1_203_2","unstructured":"Yanyue Xie Zhi Zhang Ding Zhou Cong Xie Ziang Song Xin Liu Yanzhi Wang Xue Lin and An Xu. 2024. MoE-pruner: Pruning mixture-of-experts large language model using the hints from its router. arXiv:2410.12013. Retrieved from https:\/\/arxiv.org\/abs\/2410.12013 (2024)."},{"key":"e_1_3_1_204_2","volume-title":"Proceedings of the 11th International Conference on Learning Representations","author":"Zeng Deyi Xiong and Zhiyuan","unstructured":"Deyi Xiong and Zhiyuan Zeng. 2023. SCoMoE: Efficient mixtures of experts with structured communication. In Proceedings of the 11th International Conference on Learning Representations."},{"key":"e_1_3_1_205_2","unstructured":"Mengwei Xu Wangsong Yin Dongqi Cai Rongjie Yi Daliang Xu Qipeng Wang Bingyang Wu Yihao Zhao Chen Yang Shihe Wang et\u00a0al. 2024. A survey of resource-efficient llm and multimodal foundation models. arXiv:2401.08092. Retrieved from https:\/\/arxiv.org\/abs\/2401.08092 (2024)."},{"key":"e_1_3_1_206_2","unstructured":"Fuzhao Xue Xiaoxin He Xiaozhe Ren Yuxuan Lou and Yang You. 2022. One student knows all experts know: From sparse to dense. arXiv:2201.10890. Retrieved from https:\/\/arxiv.org\/abs\/2201.10890 (2022)."},{"key":"e_1_3_1_207_2","unstructured":"Fuzhao Xue Zian Zheng Yao Fu Jinjie Ni Zangwei Zheng Wangchunshu Zhou and Yang You. 2024. Openmoe: An early effort on open mixture-of-experts language models. In Proceedings of the 41st International Conference on Machine Learning. 55625\u201355655."},{"key":"e_1_3_1_208_2","unstructured":"Leyang Xue Yao Fu Zhan Lu Luo Mai and Mahesh Marina. 2024. Moe-infinity: Activation-aware expert offloading for efficient moe serving. arXiv:2401.14361. Retrieved from https:\/\/arxiv.org\/abs\/2401.14361 (2024)."},{"key":"e_1_3_1_209_2","unstructured":"An Yang Anfeng Li Baosong Yang Beichen Zhang Binyuan Hui Bo Zheng Bowen Yu Chang Gao Chengen Huang Chenxu Lv et\u00a0al. 2025. Qwen3 technical report. arXiv:2505.09388. Retrieved from https:\/\/arxiv.org\/abs\/2505.09388 (2025)."},{"key":"e_1_3_1_210_2","unstructured":"An Yang Baosong Yang Binyuan Hui Bo Zheng Bowen Yu Chang Zhou Chengpeng Li Chengyuan Li Dayiheng Liu Fei Huang et\u00a0al. 2024. Qwen2 technical report. arXiv:2407.10671. Retrieved from https:\/\/arxiv.org\/abs\/2407.10671 (2024)."},{"key":"e_1_3_1_211_2","doi-asserted-by":"crossref","unstructured":"Cheng Yang Yang Sui Jinqi Xiao Lingyi Huang Yu Gong Yuanlin Duan Wenqi Jia Miao Yin Yu Cheng and Bo Yuan. 2024. MoE-I2: Compressing mixture of experts models through inter-expert pruning and intra-expert low-rank decomposition. Findings of the Association for Computational Linguistics: EMNLP 2024. 10456\u201310466.","DOI":"10.18653\/v1\/2024.findings-emnlp.612"},{"key":"e_1_3_1_212_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-acl.694"},{"key":"e_1_3_1_213_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1237"},{"key":"e_1_3_1_214_2","doi-asserted-by":"crossref","unstructured":"Zhilin Yang Peng Qi Saizheng Zhang Yoshua Bengio William W. Cohen Ruslan Salakhutdinov and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2369\u20132380.","DOI":"10.18653\/v1\/D18-1259"},{"key":"e_1_3_1_215_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS57955.2024.00086"},{"key":"e_1_3_1_216_2","doi-asserted-by":"crossref","unstructured":"Rongjie Yi Liwei Guo Shiyun Wei Ao Zhou Shangguang Wang and Mengwei Xu. 2025. EdgeMoE: Empowering sparse large language models on mobile devices. IEEE Transactions on Mobile Computing 24 8 (2025) 7059\u20137073.","DOI":"10.1109\/TMC.2025.3546466"},{"key":"e_1_3_1_217_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA56546.2023.10071027"},{"key":"e_1_3_1_218_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSC.2024.3399654"},{"key":"e_1_3_1_219_2","unstructured":"Longhui Yu Weisen Jiang Han Shi Jincheng Yu Zhengying Liu Yu Zhang James T. Kwok Zhenguo Li Adrian Weller and Weiyang Liu. 2024. Metamath: Bootstrap your own mathematical questions for large language models. The Twelfth International Conference on Learning Representations. https:\/\/openreview.net\/forumid=N8N0hgNDRt"},{"key":"e_1_3_1_220_2","doi-asserted-by":"publisher","DOI":"10.3390\/electronics13112077"},{"key":"e_1_3_1_221_2","doi-asserted-by":"crossref","unstructured":"Yuping Yuan Zhao You Shulin Feng Dan Su Yanchun Liang Xiaohu Shi and Dong Yu. 2023. Compressed MoE ASR model based on knowledge distillation and quantization. INTERSPEECH. 3337\u20133341.","DOI":"10.21437\/Interspeech.2023-2544"},{"key":"e_1_3_1_222_2","unstructured":"Zhihang Yuan Yuzhang Shang Yue Song Qiang Wu Yan Yan and Guangyu Sun. 2023. Asvd: Activation-aware singular value decomposition for compressing large language models. arXiv:2312.05821. Retrieved from https:\/\/arxiv.org\/abs\/2312.05821 (2023)."},{"key":"e_1_3_1_223_2","unstructured":"Zhihang Yuan Yuzhang Shang Yang Zhou Zhen Dong Zhe Zhou Chenhao Xue Bingzhe Wu Zhikai Li Qingyi Gu Yong Jae Lee et\u00a0al. 2024. Llm inference unveiled: Survey and roofline model insights. arXiv:2402.16363. Retrieved from https:\/\/arxiv.org\/abs\/2402.16363 (2024)."},{"key":"e_1_3_1_224_2","doi-asserted-by":"crossref","unstructured":"Sungmin Yun Kwanhee Kyung Juhwan Cho Jaewan Choi Jongmin Kim Byeongho Kim Sukhan Lee Kyomin Sohn and Jung Ho Ahn. 2024. Duplex: A device for large language models with mixture of experts grouped query attention and continuous batching. 2024 57th IEEE\/ACM International Symposium on Microarchitecture (MICRO). 1429\u20131443.","DOI":"10.1109\/MICRO61859.2024.00105"},{"key":"e_1_3_1_225_2","unstructured":"Aohan Zeng Xin Lv Qinkai Zheng Zhenyu Hou Bin Chen Chengxing Xie Cunxiang Wang Da Yin Hao Zeng Jiajie Zhang et\u00a0al. 2025. Glm-4.5: Agentic reasoning and coding (arc) foundation models. arXiv:2508.06471. Retrieved from https:\/\/arxiv.org\/abs\/2508.06471 (2025)."},{"key":"e_1_3_1_226_2","first-page":"961","volume-title":"Proceedings of the 2023 USENIX Annual Technical Conference (USENIX ATC 23)","author":"Zhai Mingshu","year":"2023","unstructured":"Mingshu Zhai, Jiaao He, Zixuan Ma, Zan Zong, Runqing Zhang, and Jidong Zhai. 2023. SmartMoE: Efficiently training sparsely-activated models through combining offline and online parallelization. In Proceedings of the 2023 USENIX Annual Technical Conference (USENIX ATC 23). USENIX Association, Boston, MA, 961\u2013975. Retrieved from https:\/\/www.usenix.org\/conference\/atc23\/presentation\/zhai"},{"key":"e_1_3_1_227_2","doi-asserted-by":"crossref","unstructured":"Qizhen Zhang Nikolas Gritsch Dwaraknath Gnaneshwar Simon Guo David Cairuz Bharat Venkitesh Jakob Foerster Phil Blunsom Sebastian Ruder Ahmet Ustun et\u00a0al. 2024. BAM! just like that: Simple and efficient parameter upcycling for mixture of experts. Advances in Neural Information Processing Systems 37 (2024) 56304\u201356321.","DOI":"10.52202\/079017-1792"},{"key":"e_1_3_1_228_2","doi-asserted-by":"publisher","DOI":"10.1145\/3542929.3563487"},{"key":"e_1_3_1_229_2","doi-asserted-by":"crossref","unstructured":"Xiaofeng Zhang Yikang Shen Zeyu Huang Jie Zhou Wenge Rong and Zhang Xiong. 2022. Mixture of attention heads: Selecting attention heads per token. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 4150\u20134162.","DOI":"10.18653\/v1\/2022.emnlp-main.278"},{"key":"e_1_3_1_230_2","doi-asserted-by":"crossref","unstructured":"Zeliang Zhang Xiaodong Liu Hao Cheng Chenliang Xu and Jianfeng Gao. 2025. Diversifying the expert knowledge for task-agnostic pruning in sparse mixture-of-experts. Findings of the Association for Computational Linguistics: ACL 2025. 86\u2013102.","DOI":"10.18653\/v1\/2025.findings-acl.4"},{"key":"e_1_3_1_231_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW54120.2021.00314"},{"key":"e_1_3_1_232_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2024.3385639"},{"key":"e_1_3_1_233_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.571"},{"key":"e_1_3_1_234_2","first-page":"559","volume-title":"Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, et\u00a0al. 2022. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 559\u2013578. Retrieved from https:\/\/www.usenix.org\/conference\/osdi22\/presentation\/zheng-lianmin"},{"key":"e_1_3_1_235_2","doi-asserted-by":"crossref","unstructured":"Shuzhang Zhong Ling Liang Yuan Wang Runsheng Wang Ru Huang and Meng Li. 2024. AdapMoE: Adaptive sensitivity-based expert gating and management for efficient MoE inference. In Proceedings of the 43rd IEEE\/ACM International Conference on Computer-Aided Design. 1\u20139.","DOI":"10.1145\/3676536.3676741"},{"key":"e_1_3_1_236_2","article-title":"Lima: Less is more for alignment","volume":"36","author":"Zhou Chunting","year":"2024","unstructured":"Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et\u00a0al. 2024. Lima: Less is more for alignment. Advances in Neural Information Processing Systems 36 (2024), 55006\u201355021.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_237_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA53966.2022.00082"},{"key":"e_1_3_1_238_2","first-page":"7103","article-title":"Mixture-of-experts with expert choice routing","volume":"35","author":"Zhou Yanqi","year":"2022","unstructured":"Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M. Dai, Quoc V. Le, and James Laudon. 2022. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems 35 (2022), 7103\u20137114.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_239_2","first-page":"7103","article-title":"Mixture-of-experts with expert choice routing","volume":"35","author":"Zhou Yanqi","year":"2022","unstructured":"Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M. Dai, Quoc V. Le, James Laudon, Zhifeng Chen. 2022. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems 35 (2022), 7103\u20137114.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_240_2","unstructured":"Zixuan Zhou Xuefei Ning Ke Hong Tianyu Fu Jiaming Xu Shiyao Li Yuming Lou Luning Wang Zhihang Yuan Xiuhong Li et\u00a0al. 2024. A survey on efficient inference for large language models. arXiv:2404.14294. Retrieved from https:\/\/arxiv.org\/abs\/2404.14294 (2024)."},{"key":"e_1_3_1_241_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA57654.2024.00059"},{"key":"e_1_3_1_242_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.emnlp-main.890"},{"key":"e_1_3_1_243_2","doi-asserted-by":"publisher","DOI":"10.1145\/3666025.3699355"}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3794845","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T20:44:04Z","timestamp":1775853844000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3794845"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,9]]},"references-count":242,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2026,7,31]]}},"alternative-id":["10.1145\/3794845"],"URL":"https:\/\/doi.org\/10.1145\/3794845","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,9]]},"assertion":[{"value":"2025-01-21","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-01-08","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-03-09","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}